Re: [openstack-dev] [TripleO][CI] Ability to reproduce failures

2016-04-13 Thread Steven Hardy
On Tue, Apr 12, 2016 at 11:08:28PM +0200, Gabriele Cerami wrote:
> On Fri, 2016-04-08 at 16:18 +0100, Steven Hardy wrote:
> 
> > Note we're not using devtest at all anymore, the developer script
> > many
> > folks use is tripleo.sh:
> 
> So, I followed the flow of the gate jobs starting from jenkins builder
> script, and it seems like it's using devtest (or maybe something I
> consider to be devtest but it's not, is devtest the part that creates
> some environments, wait for them to be locked by gearman, and so on ?)

So I think the confusion may step from the fact ./docs/TripleO-ci.rst is
out of date.  Derek can confirm, but I think although there may be a few
residual devtest pieces associated with managing the testenv VMs, there's
nothing related to devtest used in the actual CI run itself anymore.

See this commit:

https://github.com/openstack-infra/tripleo-ci/commit/a85deb848007f0860ac32ac0096c5e45fe899cc5

Since then we've moved to using tripleo.sh to drive most steps of the CI
run, and many developers are using it also.  Previously the same was true
of the devtest.sh script in tripleo-incubator, but now that is totally
deprecated and unused (that it still exists in the repo is an oversight).

> What I meant with "the script I'm using (created by Sagi) is not
> creating the same enviroment" is that is not using the same test env
> (with gearman and such) that the ci scripts are currently using.

Sure, I guess my point is that for 99% of issues, the method used to create
the VM is not important.  We use a slightly different method in CI to
manage the VMs than in most developer environments, but if the requirement
is to reproduce CI failures, you mostly care about deploying the exact same
software, not so much how virsh was driven to create the VMs.

Thanks for digging into this, it's great to have some fresh eyes
highlighting these sorts of issues! :)

Steve

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO][CI] Ability to reproduce failures

2016-04-12 Thread Gabriele Cerami
On Fri, 2016-04-08 at 16:18 +0100, Steven Hardy wrote:

> Note we're not using devtest at all anymore, the developer script
> many
> folks use is tripleo.sh:

So, I followed the flow of the gate jobs starting from jenkins builder
script, and it seems like it's using devtest (or maybe something I
consider to be devtest but it's not, is devtest the part that creates
some environments, wait for them to be locked by gearman, and so on ?)

What I meant with "the script I'm using (created by Sagi) is not
creating the same enviroment" is that is not using the same test env
(with gearman and such) that the ci scripts are currently using.

I'm trying to gather all the information I find in this etherpad:

https://etherpad.openstack.org/p/tripleo-ci-onboarding

If someone could review it, it might help me now, and others that wish
to join the effort later.

Thanks.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] [CI] Tempest configuration in Tripleo CI jobs

2016-04-11 Thread Sagi Shnaidman
Hi, Andrey

I've checked this option - to use rally for configuring and running tempest
test.
Although it looks like great choice, unfortunately a few issues and bugs
makes it not useful right now. For example it can not work with current
public networks and can not create new ones, so that everything that is
related to networking will fail. As I understand this bug remains already a
long time unsolved: https://bugs.launchpad.net/rally/+bug/1550848
Also it doesn't have possibility to customize configuration options when
running tempest configuration, like configure_tempest.py has - just count
them in the command line. In rally you will need to generate tempest file
and then manually to edit it for customize (for example tempest log path in
DEFAULT section). Adding "interface" for tempest configuration will be
great feature for rally IMHO.
I think it's cool approach and we definitely should take it into account,
but now it looks pretty raw and not stable enough to use it in gate jobs.
Anyway, thank you for your pointing out to this great tool.

Thanks

On Fri, Apr 8, 2016 at 2:33 PM, Andrey Kurilin 
wrote:

> Hi Sagi,
>
>
> On Thu, Apr 7, 2016 at 5:56 PM, Sagi Shnaidman 
> wrote:
>
>> Hi, all
>>
>> I'd like to discuss the topic about how do we configure tempest in CI
>> jobs for TripleO.
>> I have currently two patches:
>> support for tempest: https://review.openstack.org/#/c/295844/
>> actually run of tests: https://review.openstack.org/#/c/297038/
>>
>> Right now there is no upstream tool to configure tempest, so everybody
>> use their own tools.
>>
>
> You are wrong. There is Rally in upstream:)
> Basic and the most widely used Rally component is Task, which provides
> benchmarking and testing tool.
> But, also, Rally has Verification component(here
> 
> you can find is a bit outdated blog-post, but it can introduce Verification
> component for you).
> It can:
>
> 1. Configure Tempest based on public OpenStack API.
> An example of config from our gates:
> http://logs.openstack.org/58/285758/5/check/gate-rally-dsvm-verify-full/eabe2ff/rally-verify/5_verify_showconfig.txt.gz
> . Empty options mean that rally will check these resources while running
> tempest and create it if necessary)
>
> 2. Launch set of tests, tests which match regexp, list of tests. Also, it
> supports x-fail mechanism from out of box.
> An example of full run based on config file posted above -
> http://logs.openstack.org/58/285758/5/check/gate-rally-dsvm-verify-full/eabe2ff/rally-verify/7_verify_results.html.gz
>
> 3. Compare results.
>
> http://logs.openstack.org/58/285758/5/check/gate-rally-dsvm-verify-light/d806b91/rally-verify/17_verify_compare_--uuid-1_9fe72ea8-bd5c-45eb-9a37-5e674ea5e5d4_--uuid-2_315843d4-40b8-46f2-aa69-fb3d5d463379.html.gz
> It is not so good-looking as other rally reports, but we will fix it
> someday:)
>
> Summarize:
> - Rally is an upstream tool, which was accepted to BigTent.
> - One instance of Rally can manage and run tempest for different number of
> clouds
> - Rally Verification component is tested in gates for every new patch.
> Also it supports different APIs of services.
> - You can install, configure, launch, store results, display results in
> different formats.
>
> Btw, we are planning to refactor verification component(there is an spec
> on review with several +2), so you will be able to launch whatever you want
> subunit-based tools via Rally and simplify usage of it.
>
> However it's planned and David Mellado is working on it AFAIK.
>>
> Till then everybody use their own tools for tempest configuration.
>> I'd review two of them:
>> 1) Puppet configurations that is used in puppet modules CI
>> 2) Using configure_tempest.py script from
>> https://github.com/redhat-openstack/tempest/blob/master/tools/config_tempest.py
>>
>> Unfortunately there is no ready puppet module or script, that configures
>> tempest, you need to create your own.
>>
>> On other hand the config_tempest.py script provides full configuration,
>> support for tempest-deployer-input.conf and possibility to add any config
>> options in the command line when running it:
>>
>> python config_tempest.py \
>> --out etc/tempest.conf \
>> --debug \
>> --create \
>> --deployer-input ~/tempest-deployer-input.conf \
>> identity.uri $OS_AUTH_URL \
>> compute.allow_tenant_isolation true \
>> identity.admin_password $OS_PASSWORD \
>> compute.build_timeout 500 \
>> compute.image_ssh_user cirros
>>
>> Also it uploads images, creates necessary roles, etc. The only thing it
>> requires - existence of public network.
>> So finally all tempest configuration from scratch will be like:
>>
>> CONFIGURE_TEMPEST_DIR="$(ls
>> /usr/share/openstack-tempest-*/tools/configure-tempest-directory)"
>> $CONFIGURE_TEMPEST_DIR
>> neutron net-create nova --shared 

Re: [openstack-dev] [TripleO][CI] Ability to reproduce failures

2016-04-08 Thread Derek Higgins
On 7 April 2016 at 22:03, Gabriele Cerami  wrote:
> Hi,
>
> I'm trying to find an entry point to join the effort in TripleO CI.
Hi Gabriele, welcome aboard

> I studied the infrastructure and the scripts, but there's still something I'm 
> missing.
> The last step of studying the complex landscape of TripleO CI and the first 
> to start contributing
> is being able to reproduce failures in an accessible environment, to start 
> debugging issues.
> I have not found an easy and stable way to do this. Jobs are certainly 
> gathering
> a lot of logs, but that's not enough.
>
> At the moment, I started launching periodic jobs on my local test box using
> this script
> https://github.com/sshnaidm/various/blob/master/tripleo_repr.sh
>
> It's quite handy, but I'm not sure it's able to produce perfectly compatible
> environments with what's in CI.

Great, I haven't tried to run it but at quick glance this looks like
your doing most of main steps that are needed to mimic CI, I haven't
seen anything that is obviously missing what kind of differences are
you seeing in the results when compared to CI?

>
> Can anyone suggest a way to make jobs reproducible locally? I know it may be 
> complicated
> to setup an environment through devtest, but may If we can start with just a 
> list of steps,
> then it would be easier to put them into a script, hten make it availabe in 
> the log in place
> of the current reproduce.sh that is not very useful.

So, I think the problem with reproduce.sh is that nobody in tripleo
has ever used it, and as a result toci_gate_test.sh and
toci_instack.sh just aren't compatible with it, I'd suggest we change
the the toci_* scripts so that they play nice together. I'll see if I
can give it a whirl over the next few days and see what problem we're
likely to hit.

>
> thanks for any feedback.
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO][CI] Ability to reproduce failures

2016-04-08 Thread Steven Hardy
Hi Gabriele,

On Thu, Apr 07, 2016 at 05:03:33PM -0400, Gabriele Cerami wrote:
> Hi,
> 
> I'm trying to find an entry point to join the effort in TripleO CI.
> I studied the infrastructure and the scripts, but there's still something I'm 
> missing.
> The last step of studying the complex landscape of TripleO CI and the first 
> to start contributing
> is being able to reproduce failures in an accessible environment, to start 
> debugging issues.
> I have not found an easy and stable way to do this. Jobs are certainly 
> gathering
> a lot of logs, but that's not enough.
> 
> At the moment, I started launching periodic jobs on my local test box using
> this script
> https://github.com/sshnaidm/various/blob/master/tripleo_repr.sh
> 
> It's quite handy, but I'm not sure it's able to produce perfectly compatible 
> environments with what's in CI.
> 
> Can anyone suggest a way to make jobs reproducible locally? I know it may be 
> complicated
> to setup an environment through devtest, but may If we can start with just a 
> list of steps, 
> then it would be easier to put them into a script, hten make it availabe in 
> the log in place
> of the current reproduce.sh that is not very useful.

Note we're not using devtest at all anymore, the developer script many
folks use is tripleo.sh:

https://github.com/openstack-infra/tripleo-ci/blob/master/scripts/tripleo.sh

Your script is already calling this so you're pretty close to what others
are using I think.

There is an effort in https://github.com/openstack/tripleo-quickstart to
make this process more automated, but my setup process looks like this
http://paste.fedoraproject.org/351818/1283361/ (note the RAM allocation
there is more than is typically required)

In both cases you end up using a different process to setup the instack VM,
but otherwise gets you very close to what CI runs and generally reproduces
issues.

Note that once you have an environment, you can often just yum update the
undercloud (and possibly re-run openstack undercloud install) to reproduce
issues (sometimes you'll also have to rebuild the overcloud-full image) -
so when I seen an issue in CI typically my first step is incrementally
updating things on my local environment rather than completly building a
new one from scratch (which obviously takes much longer).

I agree we could make this easier, and it'd be good if those scripts other
than tripleo.sh were more easily reusable in developer environments.

Steve

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] [CI] Tempest configuration in Tripleo CI jobs

2016-04-08 Thread Andrey Kurilin
Hi Sagi,


On Thu, Apr 7, 2016 at 5:56 PM, Sagi Shnaidman  wrote:

> Hi, all
>
> I'd like to discuss the topic about how do we configure tempest in CI jobs
> for TripleO.
> I have currently two patches:
> support for tempest: https://review.openstack.org/#/c/295844/
> actually run of tests: https://review.openstack.org/#/c/297038/
>
> Right now there is no upstream tool to configure tempest, so everybody use
> their own tools.
>

You are wrong. There is Rally in upstream:)
Basic and the most widely used Rally component is Task, which provides
benchmarking and testing tool.
But, also, Rally has Verification component(here

you can find is a bit outdated blog-post, but it can introduce Verification
component for you).
It can:

1. Configure Tempest based on public OpenStack API.
An example of config from our gates:
http://logs.openstack.org/58/285758/5/check/gate-rally-dsvm-verify-full/eabe2ff/rally-verify/5_verify_showconfig.txt.gz
. Empty options mean that rally will check these resources while running
tempest and create it if necessary)

2. Launch set of tests, tests which match regexp, list of tests. Also, it
supports x-fail mechanism from out of box.
An example of full run based on config file posted above -
http://logs.openstack.org/58/285758/5/check/gate-rally-dsvm-verify-full/eabe2ff/rally-verify/7_verify_results.html.gz

3. Compare results.
http://logs.openstack.org/58/285758/5/check/gate-rally-dsvm-verify-light/d806b91/rally-verify/17_verify_compare_--uuid-1_9fe72ea8-bd5c-45eb-9a37-5e674ea5e5d4_--uuid-2_315843d4-40b8-46f2-aa69-fb3d5d463379.html.gz
It is not so good-looking as other rally reports, but we will fix it
someday:)

Summarize:
- Rally is an upstream tool, which was accepted to BigTent.
- One instance of Rally can manage and run tempest for different number of
clouds
- Rally Verification component is tested in gates for every new patch. Also
it supports different APIs of services.
- You can install, configure, launch, store results, display results in
different formats.

Btw, we are planning to refactor verification component(there is an spec on
review with several +2), so you will be able to launch whatever you want
subunit-based tools via Rally and simplify usage of it.

However it's planned and David Mellado is working on it AFAIK.
>
Till then everybody use their own tools for tempest configuration.
> I'd review two of them:
> 1) Puppet configurations that is used in puppet modules CI
> 2) Using configure_tempest.py script from
> https://github.com/redhat-openstack/tempest/blob/master/tools/config_tempest.py
>
> Unfortunately there is no ready puppet module or script, that configures
> tempest, you need to create your own.
>
> On other hand the config_tempest.py script provides full configuration,
> support for tempest-deployer-input.conf and possibility to add any config
> options in the command line when running it:
>
> python config_tempest.py \
> --out etc/tempest.conf \
> --debug \
> --create \
> --deployer-input ~/tempest-deployer-input.conf \
> identity.uri $OS_AUTH_URL \
> compute.allow_tenant_isolation true \
> identity.admin_password $OS_PASSWORD \
> compute.build_timeout 500 \
> compute.image_ssh_user cirros
>
> Also it uploads images, creates necessary roles, etc. The only thing it
> requires - existence of public network.
> So finally all tempest configuration from scratch will be like:
>
> CONFIGURE_TEMPEST_DIR="$(ls
> /usr/share/openstack-tempest-*/tools/configure-tempest-directory)"
> $CONFIGURE_TEMPEST_DIR
> neutron net-create nova --shared --router:external=True
> --provider:network_type flat --provider:physical_network datacentre;
> neutron subnet-create --name ext-subnet --allocation-pool
> start=$FLOATING_IP_START,end=$FLOATING_IP_END --disable-dhcp --gateway
> $EXTERNAL_NETWORK_GATEWAY nova $FLOATING_IP_CIDR;
> python tempest/tools/install_venv.py
> python config_tempest.py \
> --out etc/tempest.conf \
> --debug \
> --create \
> --deployer-input ~/tempest-deployer-input.conf \
> identity.uri $OS_AUTH_URL \
> compute.allow_tenant_isolation true \
> identity.admin_password $OS_PASSWORD \
> compute.build_timeout 500 \
> compute.image_ssh_user cirros
> testr init; testr run
>
> In my patch [1] I have proposed it with little changes to TripleO CI repo
> [2]
>
> Like I wrote before there is an option to use puppet for this, I spent a
> time to investigate how to do it and would like share the results with your
> in order to compare it with config_tempest.py approach.
>
> First of all it's surprising that puppet-tempest actually doesn't know to
> do almost anything. All it knows - it's to set IDs of public network (but
> not router) and images. That's all. All the rest you need to configure
> manually.
> Then comes another problem - 

[openstack-dev] [TripleO][CI] Ability to reproduce failures

2016-04-07 Thread Gabriele Cerami
Hi,

I'm trying to find an entry point to join the effort in TripleO CI.
I studied the infrastructure and the scripts, but there's still something I'm 
missing.
The last step of studying the complex landscape of TripleO CI and the first to 
start contributing
is being able to reproduce failures in an accessible environment, to start 
debugging issues.
I have not found an easy and stable way to do this. Jobs are certainly gathering
a lot of logs, but that's not enough.

At the moment, I started launching periodic jobs on my local test box using
this script
https://github.com/sshnaidm/various/blob/master/tripleo_repr.sh

It's quite handy, but I'm not sure it's able to produce perfectly compatible 
environments with what's in CI.

Can anyone suggest a way to make jobs reproducible locally? I know it may be 
complicated
to setup an environment through devtest, but may If we can start with just a 
list of steps, 
then it would be easier to put them into a script, hten make it availabe in the 
log in place
of the current reproduce.sh that is not very useful.

thanks for any feedback.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [TripleO] [CI] Tempest configuration in Tripleo CI jobs

2016-04-07 Thread Sagi Shnaidman
Hi, all

I'd like to discuss the topic about how do we configure tempest in CI jobs
for TripleO.
I have currently two patches:
support for tempest: https://review.openstack.org/#/c/295844/
actually run of tests: https://review.openstack.org/#/c/297038/

Right now there is no upstream tool to configure tempest, so everybody use
their own tools. However it's planned and David Mellado is working on it
AFAIK.
Till then everybody use their own tools for tempest configuration.
I'd review two of them:
1) Puppet configurations that is used in puppet modules CI
2) Using configure_tempest.py script from
https://github.com/redhat-openstack/tempest/blob/master/tools/config_tempest.py

Unfortunately there is no ready puppet module or script, that configures
tempest, you need to create your own.

On other hand the config_tempest.py script provides full configuration,
support for tempest-deployer-input.conf and possibility to add any config
options in the command line when running it:

python config_tempest.py \
--out etc/tempest.conf \
--debug \
--create \
--deployer-input ~/tempest-deployer-input.conf \
identity.uri $OS_AUTH_URL \
compute.allow_tenant_isolation true \
identity.admin_password $OS_PASSWORD \
compute.build_timeout 500 \
compute.image_ssh_user cirros

Also it uploads images, creates necessary roles, etc. The only thing it
requires - existence of public network.
So finally all tempest configuration from scratch will be like:

CONFIGURE_TEMPEST_DIR="$(ls
/usr/share/openstack-tempest-*/tools/configure-tempest-directory)"
$CONFIGURE_TEMPEST_DIR
neutron net-create nova --shared --router:external=True
--provider:network_type flat --provider:physical_network datacentre;
neutron subnet-create --name ext-subnet --allocation-pool
start=$FLOATING_IP_START,end=$FLOATING_IP_END --disable-dhcp --gateway
$EXTERNAL_NETWORK_GATEWAY nova $FLOATING_IP_CIDR;
python tempest/tools/install_venv.py
python config_tempest.py \
--out etc/tempest.conf \
--debug \
--create \
--deployer-input ~/tempest-deployer-input.conf \
identity.uri $OS_AUTH_URL \
compute.allow_tenant_isolation true \
identity.admin_password $OS_PASSWORD \
compute.build_timeout 500 \
compute.image_ssh_user cirros
testr init; testr run

In my patch [1] I have proposed it with little changes to TripleO CI repo
[2]

Like I wrote before there is an option to use puppet for this, I spent a
time to investigate how to do it and would like share the results with your
in order to compare it with config_tempest.py approach.

First of all it's surprising that puppet-tempest actually doesn't know to
do almost anything. All it knows - it's to set IDs of public network (but
not router) and images. That's all. All the rest you need to configure
manually.
Then comes another problem - you can use it only on overcloud controller
node, where are all service configurations and hiera data. Most of values
are taken directly from /etc/{service}/service.conf files, so doing it on
undercloud you will configure undercloud itself (instead of overcloud)

So first of all you need to upload this manifest to controller node of
overcloud.
Let's write this puppet manifest, I wrote everything in one file for saving
a time, but of course it should be a module with usual puppet module
structure: module_name/manifests/init.pp with module_name class.

Manual configurations:

class testt::config {
  $os_username = 'admin'
  $os_tenant_name = hiera(keystone::roles::admin::admin_tenant)
  $os_password = hiera(admin_password)
  $os_auth_url = hiera(keystone::endpoint::public_url)
  $keystone_auth_uri = regsubst($os_auth_url, '/v2.0', '')
  $floating_range   = "192.0.2.0/24"
  $gateway_ip   = "192.0.2.1"
  $floating_pool= 'start=192.0.2.50,end=192.0.2.99'
  $fixed_range  = '10.0.0.0/24'
  $router_name  = 'router1'
  $ca_bundle_cert_path = '/etc/ssl/certs/ca-bundle.crt'
  $cert_path   =
'/etc/pki/ca-trust/source/anchors/puppet_openstack.pem'
  $update_ca_certs_cmd = '/usr/bin/update-ca-trust force-enable &&
/usr/bin/update-ca-trust extract'
  $host_url = regsubst($keystone_auth_uri, ':5000', '')
}

Most of data is taken from hiera on the controller host. (/etc/hieradata)
Then we start actually the tempest configuration. Surprisingly it doesn't
have resource type to work with flavors, so all its configuration is done
by "exec"s. We run puppet with bash to run bash within a puppet, what gives
pretty big overhead.

class testt::provision {
  include testt::config

  $os_auth_options = "--os-username ${config::os_username} --os-password
${config::os_password} --os-tenant-name ${config::os_tenant_name}
--os-auth-url ${config::os_auth_url}/v2.0"

  exec { 'manage_m1.nano_nova_flavor':
path => '/usr/bin:/bin:/usr/sbin:/sbin',
provider => shell,
command  => "nova ${os_auth_options} flavor-delete m1.nano ||: ; nova

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-09 Thread Dan Prince
On Tue, 2016-03-08 at 17:58 +, Derek Higgins wrote:
> On 7 March 2016 at 18:22, Ben Nemec  wrote:
> > 
> > On 03/07/2016 11:33 AM, Derek Higgins wrote:
> > > 
> > > On 7 March 2016 at 15:24, Derek Higgins 
> > > wrote:
> > > > 
> > > > On 6 March 2016 at 16:58, James Slagle 
> > > > wrote:
> > > > > 
> > > > > On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  > > > > at.com> wrote:
> > > > > > 
> > > > > > I'm kind of hijacking Dan's e-mail but I would like to
> > > > > > propose some
> > > > > > technical improvements to stop having so much CI failures.
> > > > > > 
> > > > > > 
> > > > > > 1/ Stop creating swap files. We don't have SSD, this is
> > > > > > IMHO a terrible
> > > > > > mistake to swap on files because we don't have enough RAM.
> > > > > > In my
> > > > > > experience, swaping on non-SSD disks is even worst that not
> > > > > > having
> > > > > > enough RAM. We should stop doing that I think.
> > > > > We have been relying on swap in tripleo-ci for a little
> > > > > while. While
> > > > > not ideal, it has been an effective way to at least be able
> > > > > to test
> > > > > what we've been testing given the amount of physical RAM that
> > > > > is
> > > > > available.
> > > > Ok, so I have a few points here, in places where I'm making
> > > > assumptions I'll try to point it out
> > > > 
> > > > o Yes I agree using swap should be avoided if at all possible
> > > > 
> > > > o We are currently looking into adding more RAM to our testenv
> > > > hosts,
> > > > it which point we can afford to be a little more liberal with
> > > > Memory
> > > > and this problem should become less of an issue, having said
> > > > that
> > > > 
> > > > o Even though using swap is bad, if we have some processes with
> > > > a
> > > > large Mem footprint that don't require constant access to a
> > > > portion of
> > > > the footprint swaping it out over the duration of the CI test
> > > > isn't as
> > > > expensive as it would suggest (assuming it doesn't need to be
> > > > swapped
> > > > back in and the kernel has selected good candidates to swap
> > > > out)
> > > > 
> > > > o The test envs that host the undercloud and overcloud nodes
> > > > have 64G
> > > > of RAM each, they each host 4 testenvs and each test env if
> > > > running a
> > > > HA job can use up to 21G of RAM so we have over committed
> > > > there, it
> > > > this is only a problem if a test env host gets 4 HA jobs that
> > > > are
> > > > started around the same time (and as a result a each have 4
> > > > overcloud
> > > > nodes running at the same time), to allow this to happen
> > > > without VM's
> > > > being killed by the OOM we've also enabled swap there. The
> > > > majority of
> > > > the time this swap isn't in use, only if all 4 testenvs are
> > > > being
> > > > simultaneously used and they are all running the second half of
> > > > a CI
> > > > test at the same time.
> > > > 
> > > > o The overcloud nodes are VM's running with a "unsafe" disk
> > > > caching
> > > > mechanism, this causes sync requests from guest to be ignored
> > > > and as a
> > > > result if the instances being hosted on these nodes are going
> > > > into
> > > > swap this swap will be cached on the host as long as RAM is
> > > > available.
> > > > i.e. swap being used in the undercloud or overcloud isn't being
> > > > synced
> > > > to the disk on the host unless it has to be.
> > > > 
> > > > o What I'd like us to avoid is simply bumping up the memory
> > > > every time
> > > > we hit a OOM error without at least
> > > >   1. Explaining why we need more memory all of a sudden
> > > >   2. Looking into a way we may be able to avoid simply bumping
> > > > the RAM
> > > > (at peak times we are memory constrained)
> > > > 
> > > > as an example, Lets take a look at the swap usage on the
> > > > undercloud of
> > > > a recent ci nonha job[1][2], These insances have 5G of RAM with
> > > > 2G or
> > > > swap enabled via a swapfile
> > > > the overcloud deploy started @22:07:46 and finished at
> > > > @22:28:06
> > > > 
> > > > In the graph you'll see a spike in memory being swapped out
> > > > around
> > > > 22:09, this corresponds almost exactly to when the overcloud
> > > > image is
> > > > being downloaded from swift[3], looking the top output at the
> > > > end of
> > > > the test you'll see that swift-proxy is using over 500M of
> > > > Mem[4].
> > > > 
> > > > I'd much prefer we spend time looking into why the swift proxy
> > > > is
> > > > using this much memory rather then blindly bump the memory
> > > > allocated
> > > > to the VM, perhaps we have something configured incorrectly or
> > > > we've
> > > > hit a bug in swift.
> > > > 
> > > > Having said all that we can bump the memory allocated to each
> > > > node but
> > > > we have to accept 1 of 2 possible consequences
> > > > 1. We'll env up using the swap on the testenv hosts more then
> > > > we
> > > > currently are 

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-09 Thread Derek Higgins
On 9 March 2016 at 07:08, Richard Su  wrote:
>
>
> On 03/08/2016 09:58 AM, Derek Higgins wrote:
>>
>> On 7 March 2016 at 18:22, Ben Nemec  wrote:
>>>
>>> On 03/07/2016 11:33 AM, Derek Higgins wrote:

 On 7 March 2016 at 15:24, Derek Higgins  wrote:
>
> On 6 March 2016 at 16:58, James Slagle  wrote:
>>
>> On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi 
>> wrote:
>>>
>>> I'm kind of hijacking Dan's e-mail but I would like to propose some
>>> technical improvements to stop having so much CI failures.
>>>
>>>
>>> 1/ Stop creating swap files. We don't have SSD, this is IMHO a
>>> terrible
>>> mistake to swap on files because we don't have enough RAM. In my
>>> experience, swaping on non-SSD disks is even worst that not having
>>> enough RAM. We should stop doing that I think.
>>
>> We have been relying on swap in tripleo-ci for a little while. While
>> not ideal, it has been an effective way to at least be able to test
>> what we've been testing given the amount of physical RAM that is
>> available.
>
> Ok, so I have a few points here, in places where I'm making
> assumptions I'll try to point it out
>
> o Yes I agree using swap should be avoided if at all possible
>
> o We are currently looking into adding more RAM to our testenv hosts,
> it which point we can afford to be a little more liberal with Memory
> and this problem should become less of an issue, having said that
>
> o Even though using swap is bad, if we have some processes with a
> large Mem footprint that don't require constant access to a portion of
> the footprint swaping it out over the duration of the CI test isn't as
> expensive as it would suggest (assuming it doesn't need to be swapped
> back in and the kernel has selected good candidates to swap out)
>
> o The test envs that host the undercloud and overcloud nodes have 64G
> of RAM each, they each host 4 testenvs and each test env if running a
> HA job can use up to 21G of RAM so we have over committed there, it
> this is only a problem if a test env host gets 4 HA jobs that are
> started around the same time (and as a result a each have 4 overcloud
> nodes running at the same time), to allow this to happen without VM's
> being killed by the OOM we've also enabled swap there. The majority of
> the time this swap isn't in use, only if all 4 testenvs are being
> simultaneously used and they are all running the second half of a CI
> test at the same time.
>
> o The overcloud nodes are VM's running with a "unsafe" disk caching
> mechanism, this causes sync requests from guest to be ignored and as a
> result if the instances being hosted on these nodes are going into
> swap this swap will be cached on the host as long as RAM is available.
> i.e. swap being used in the undercloud or overcloud isn't being synced
> to the disk on the host unless it has to be.
>
> o What I'd like us to avoid is simply bumping up the memory every time
> we hit a OOM error without at least
>1. Explaining why we need more memory all of a sudden
>2. Looking into a way we may be able to avoid simply bumping the RAM
> (at peak times we are memory constrained)
>
> as an example, Lets take a look at the swap usage on the undercloud of
> a recent ci nonha job[1][2], These insances have 5G of RAM with 2G or
> swap enabled via a swapfile
> the overcloud deploy started @22:07:46 and finished at @22:28:06
>
> In the graph you'll see a spike in memory being swapped out around
> 22:09, this corresponds almost exactly to when the overcloud image is
> being downloaded from swift[3], looking the top output at the end of
> the test you'll see that swift-proxy is using over 500M of Mem[4].
>
> I'd much prefer we spend time looking into why the swift proxy is
> using this much memory rather then blindly bump the memory allocated
> to the VM, perhaps we have something configured incorrectly or we've
> hit a bug in swift.
>
> Having said all that we can bump the memory allocated to each node but
> we have to accept 1 of 2 possible consequences
> 1. We'll env up using the swap on the testenv hosts more then we
> currently are or
> 2. We'll have to reduce the number of test envs per host from 4 down
> to 3, wiping 25% of our capacity

 Thinking about this a little more, we could do a radical experiment
 for a week and just do this, i.e. bump up the RAM on each env and
 accept we loose 25 of our capacity, maybe it doesn't matter, if our
 success rate goes up then we'd be running less rechecks anyways.
 The downside is that we'd probably hit less timing errors (assuming
 the 

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-08 Thread Richard Su



On 03/08/2016 09:58 AM, Derek Higgins wrote:

On 7 March 2016 at 18:22, Ben Nemec  wrote:

On 03/07/2016 11:33 AM, Derek Higgins wrote:

On 7 March 2016 at 15:24, Derek Higgins  wrote:

On 6 March 2016 at 16:58, James Slagle  wrote:

On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  wrote:

I'm kind of hijacking Dan's e-mail but I would like to propose some
technical improvements to stop having so much CI failures.


1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
mistake to swap on files because we don't have enough RAM. In my
experience, swaping on non-SSD disks is even worst that not having
enough RAM. We should stop doing that I think.

We have been relying on swap in tripleo-ci for a little while. While
not ideal, it has been an effective way to at least be able to test
what we've been testing given the amount of physical RAM that is
available.

Ok, so I have a few points here, in places where I'm making
assumptions I'll try to point it out

o Yes I agree using swap should be avoided if at all possible

o We are currently looking into adding more RAM to our testenv hosts,
it which point we can afford to be a little more liberal with Memory
and this problem should become less of an issue, having said that

o Even though using swap is bad, if we have some processes with a
large Mem footprint that don't require constant access to a portion of
the footprint swaping it out over the duration of the CI test isn't as
expensive as it would suggest (assuming it doesn't need to be swapped
back in and the kernel has selected good candidates to swap out)

o The test envs that host the undercloud and overcloud nodes have 64G
of RAM each, they each host 4 testenvs and each test env if running a
HA job can use up to 21G of RAM so we have over committed there, it
this is only a problem if a test env host gets 4 HA jobs that are
started around the same time (and as a result a each have 4 overcloud
nodes running at the same time), to allow this to happen without VM's
being killed by the OOM we've also enabled swap there. The majority of
the time this swap isn't in use, only if all 4 testenvs are being
simultaneously used and they are all running the second half of a CI
test at the same time.

o The overcloud nodes are VM's running with a "unsafe" disk caching
mechanism, this causes sync requests from guest to be ignored and as a
result if the instances being hosted on these nodes are going into
swap this swap will be cached on the host as long as RAM is available.
i.e. swap being used in the undercloud or overcloud isn't being synced
to the disk on the host unless it has to be.

o What I'd like us to avoid is simply bumping up the memory every time
we hit a OOM error without at least
   1. Explaining why we need more memory all of a sudden
   2. Looking into a way we may be able to avoid simply bumping the RAM
(at peak times we are memory constrained)

as an example, Lets take a look at the swap usage on the undercloud of
a recent ci nonha job[1][2], These insances have 5G of RAM with 2G or
swap enabled via a swapfile
the overcloud deploy started @22:07:46 and finished at @22:28:06

In the graph you'll see a spike in memory being swapped out around
22:09, this corresponds almost exactly to when the overcloud image is
being downloaded from swift[3], looking the top output at the end of
the test you'll see that swift-proxy is using over 500M of Mem[4].

I'd much prefer we spend time looking into why the swift proxy is
using this much memory rather then blindly bump the memory allocated
to the VM, perhaps we have something configured incorrectly or we've
hit a bug in swift.

Having said all that we can bump the memory allocated to each node but
we have to accept 1 of 2 possible consequences
1. We'll env up using the swap on the testenv hosts more then we
currently are or
2. We'll have to reduce the number of test envs per host from 4 down
to 3, wiping 25% of our capacity

Thinking about this a little more, we could do a radical experiment
for a week and just do this, i.e. bump up the RAM on each env and
accept we loose 25 of our capacity, maybe it doesn't matter, if our
success rate goes up then we'd be running less rechecks anyways.
The downside is that we'd probably hit less timing errors (assuming
the tight resources is whats showing them up), I say downside because
this just means downstream users might hit them more often if CI
isn't. Anyways maybe worth discussing at tomorrows meeting.

+1 to reducing the number of testenvs and allocating more memory to
each.  The huge number of rechecks we're having to do is definitely
contributing to our CI load in a big way, so if we could cut those down
by 50% I bet it would offset the lost testenvs.  And it would reduce
developer aggravation by about a million percent. :-)

Also, on some level I'm not too concerned about the absolute minimum
memory use case.  Nobody 

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-08 Thread James Slagle
On Tue, Mar 8, 2016 at 12:58 PM, Derek Higgins  wrote:
> We discussed this at today's meeting but never really came to a
> conclusion except to say most people wanted to try it. The main
> objection brought up was that we shouldn't go dropping the nonha job,
> that isn't what I was proposing so let me rephrase here and see if we
> can gather +/-1's
>
> I'm proposing we redeploy our testenvs with more RAM allocated per
> env, specifically we would go from
> 5G undercloud and 4G overcloud nodes to
> 6G undercloud and 5G overcloud nodes to

+1

>
> In addition to accommodate this we would reduce the number of env's
> available from 48 (the actually number varies from time to time) to 36
> (3 envs per host)
>
> No changes would be happening on the jobs we actually run

+1

>
> The assumption is that with the increased resources we would hit less
> false negative test results and as a result recheck jobs less (so the
> 25% reduction in capacity wouldn't hit us as hard as it might seem),
> we also may not be able to easily undo this if it doesn't work out as
> once we start merging things that use the extra RAM it will be hard to
> go back.

The CPU load is also very high. When I have been looking at jobs that
appear stuck, it takes almost 3 minutes just to do a nova list
sometimes. So I think 1 less testenv on each host will help that as
well.

-- 
-- James Slagle
--

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-08 Thread Ben Nemec
On 03/08/2016 11:58 AM, Derek Higgins wrote:
> On 7 March 2016 at 18:22, Ben Nemec  wrote:
>> On 03/07/2016 11:33 AM, Derek Higgins wrote:
>>> On 7 March 2016 at 15:24, Derek Higgins  wrote:
 On 6 March 2016 at 16:58, James Slagle  wrote:
> On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  
> wrote:
>> I'm kind of hijacking Dan's e-mail but I would like to propose some
>> technical improvements to stop having so much CI failures.
>>
>>
>> 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
>> mistake to swap on files because we don't have enough RAM. In my
>> experience, swaping on non-SSD disks is even worst that not having
>> enough RAM. We should stop doing that I think.
>
> We have been relying on swap in tripleo-ci for a little while. While
> not ideal, it has been an effective way to at least be able to test
> what we've been testing given the amount of physical RAM that is
> available.

 Ok, so I have a few points here, in places where I'm making
 assumptions I'll try to point it out

 o Yes I agree using swap should be avoided if at all possible

 o We are currently looking into adding more RAM to our testenv hosts,
 it which point we can afford to be a little more liberal with Memory
 and this problem should become less of an issue, having said that

 o Even though using swap is bad, if we have some processes with a
 large Mem footprint that don't require constant access to a portion of
 the footprint swaping it out over the duration of the CI test isn't as
 expensive as it would suggest (assuming it doesn't need to be swapped
 back in and the kernel has selected good candidates to swap out)

 o The test envs that host the undercloud and overcloud nodes have 64G
 of RAM each, they each host 4 testenvs and each test env if running a
 HA job can use up to 21G of RAM so we have over committed there, it
 this is only a problem if a test env host gets 4 HA jobs that are
 started around the same time (and as a result a each have 4 overcloud
 nodes running at the same time), to allow this to happen without VM's
 being killed by the OOM we've also enabled swap there. The majority of
 the time this swap isn't in use, only if all 4 testenvs are being
 simultaneously used and they are all running the second half of a CI
 test at the same time.

 o The overcloud nodes are VM's running with a "unsafe" disk caching
 mechanism, this causes sync requests from guest to be ignored and as a
 result if the instances being hosted on these nodes are going into
 swap this swap will be cached on the host as long as RAM is available.
 i.e. swap being used in the undercloud or overcloud isn't being synced
 to the disk on the host unless it has to be.

 o What I'd like us to avoid is simply bumping up the memory every time
 we hit a OOM error without at least
   1. Explaining why we need more memory all of a sudden
   2. Looking into a way we may be able to avoid simply bumping the RAM
 (at peak times we are memory constrained)

 as an example, Lets take a look at the swap usage on the undercloud of
 a recent ci nonha job[1][2], These insances have 5G of RAM with 2G or
 swap enabled via a swapfile
 the overcloud deploy started @22:07:46 and finished at @22:28:06

 In the graph you'll see a spike in memory being swapped out around
 22:09, this corresponds almost exactly to when the overcloud image is
 being downloaded from swift[3], looking the top output at the end of
 the test you'll see that swift-proxy is using over 500M of Mem[4].

 I'd much prefer we spend time looking into why the swift proxy is
 using this much memory rather then blindly bump the memory allocated
 to the VM, perhaps we have something configured incorrectly or we've
 hit a bug in swift.

 Having said all that we can bump the memory allocated to each node but
 we have to accept 1 of 2 possible consequences
 1. We'll env up using the swap on the testenv hosts more then we
 currently are or
 2. We'll have to reduce the number of test envs per host from 4 down
 to 3, wiping 25% of our capacity
>>>
>>> Thinking about this a little more, we could do a radical experiment
>>> for a week and just do this, i.e. bump up the RAM on each env and
>>> accept we loose 25 of our capacity, maybe it doesn't matter, if our
>>> success rate goes up then we'd be running less rechecks anyways.
>>> The downside is that we'd probably hit less timing errors (assuming
>>> the tight resources is whats showing them up), I say downside because
>>> this just means downstream users might hit them more often if CI
>>> isn't. Anyways maybe worth discussing at tomorrows meeting.

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-08 Thread Derek Higgins
On 7 March 2016 at 18:22, Ben Nemec  wrote:
> On 03/07/2016 11:33 AM, Derek Higgins wrote:
>> On 7 March 2016 at 15:24, Derek Higgins  wrote:
>>> On 6 March 2016 at 16:58, James Slagle  wrote:
 On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  wrote:
> I'm kind of hijacking Dan's e-mail but I would like to propose some
> technical improvements to stop having so much CI failures.
>
>
> 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
> mistake to swap on files because we don't have enough RAM. In my
> experience, swaping on non-SSD disks is even worst that not having
> enough RAM. We should stop doing that I think.

 We have been relying on swap in tripleo-ci for a little while. While
 not ideal, it has been an effective way to at least be able to test
 what we've been testing given the amount of physical RAM that is
 available.
>>>
>>> Ok, so I have a few points here, in places where I'm making
>>> assumptions I'll try to point it out
>>>
>>> o Yes I agree using swap should be avoided if at all possible
>>>
>>> o We are currently looking into adding more RAM to our testenv hosts,
>>> it which point we can afford to be a little more liberal with Memory
>>> and this problem should become less of an issue, having said that
>>>
>>> o Even though using swap is bad, if we have some processes with a
>>> large Mem footprint that don't require constant access to a portion of
>>> the footprint swaping it out over the duration of the CI test isn't as
>>> expensive as it would suggest (assuming it doesn't need to be swapped
>>> back in and the kernel has selected good candidates to swap out)
>>>
>>> o The test envs that host the undercloud and overcloud nodes have 64G
>>> of RAM each, they each host 4 testenvs and each test env if running a
>>> HA job can use up to 21G of RAM so we have over committed there, it
>>> this is only a problem if a test env host gets 4 HA jobs that are
>>> started around the same time (and as a result a each have 4 overcloud
>>> nodes running at the same time), to allow this to happen without VM's
>>> being killed by the OOM we've also enabled swap there. The majority of
>>> the time this swap isn't in use, only if all 4 testenvs are being
>>> simultaneously used and they are all running the second half of a CI
>>> test at the same time.
>>>
>>> o The overcloud nodes are VM's running with a "unsafe" disk caching
>>> mechanism, this causes sync requests from guest to be ignored and as a
>>> result if the instances being hosted on these nodes are going into
>>> swap this swap will be cached on the host as long as RAM is available.
>>> i.e. swap being used in the undercloud or overcloud isn't being synced
>>> to the disk on the host unless it has to be.
>>>
>>> o What I'd like us to avoid is simply bumping up the memory every time
>>> we hit a OOM error without at least
>>>   1. Explaining why we need more memory all of a sudden
>>>   2. Looking into a way we may be able to avoid simply bumping the RAM
>>> (at peak times we are memory constrained)
>>>
>>> as an example, Lets take a look at the swap usage on the undercloud of
>>> a recent ci nonha job[1][2], These insances have 5G of RAM with 2G or
>>> swap enabled via a swapfile
>>> the overcloud deploy started @22:07:46 and finished at @22:28:06
>>>
>>> In the graph you'll see a spike in memory being swapped out around
>>> 22:09, this corresponds almost exactly to when the overcloud image is
>>> being downloaded from swift[3], looking the top output at the end of
>>> the test you'll see that swift-proxy is using over 500M of Mem[4].
>>>
>>> I'd much prefer we spend time looking into why the swift proxy is
>>> using this much memory rather then blindly bump the memory allocated
>>> to the VM, perhaps we have something configured incorrectly or we've
>>> hit a bug in swift.
>>>
>>> Having said all that we can bump the memory allocated to each node but
>>> we have to accept 1 of 2 possible consequences
>>> 1. We'll env up using the swap on the testenv hosts more then we
>>> currently are or
>>> 2. We'll have to reduce the number of test envs per host from 4 down
>>> to 3, wiping 25% of our capacity
>>
>> Thinking about this a little more, we could do a radical experiment
>> for a week and just do this, i.e. bump up the RAM on each env and
>> accept we loose 25 of our capacity, maybe it doesn't matter, if our
>> success rate goes up then we'd be running less rechecks anyways.
>> The downside is that we'd probably hit less timing errors (assuming
>> the tight resources is whats showing them up), I say downside because
>> this just means downstream users might hit them more often if CI
>> isn't. Anyways maybe worth discussing at tomorrows meeting.
>
> +1 to reducing the number of testenvs and allocating more memory to
> each.  The huge number of rechecks we're having to do is 

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-07 Thread Ben Nemec
On 03/07/2016 11:33 AM, Derek Higgins wrote:
> On 7 March 2016 at 15:24, Derek Higgins  wrote:
>> On 6 March 2016 at 16:58, James Slagle  wrote:
>>> On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  wrote:
 I'm kind of hijacking Dan's e-mail but I would like to propose some
 technical improvements to stop having so much CI failures.


 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
 mistake to swap on files because we don't have enough RAM. In my
 experience, swaping on non-SSD disks is even worst that not having
 enough RAM. We should stop doing that I think.
>>>
>>> We have been relying on swap in tripleo-ci for a little while. While
>>> not ideal, it has been an effective way to at least be able to test
>>> what we've been testing given the amount of physical RAM that is
>>> available.
>>
>> Ok, so I have a few points here, in places where I'm making
>> assumptions I'll try to point it out
>>
>> o Yes I agree using swap should be avoided if at all possible
>>
>> o We are currently looking into adding more RAM to our testenv hosts,
>> it which point we can afford to be a little more liberal with Memory
>> and this problem should become less of an issue, having said that
>>
>> o Even though using swap is bad, if we have some processes with a
>> large Mem footprint that don't require constant access to a portion of
>> the footprint swaping it out over the duration of the CI test isn't as
>> expensive as it would suggest (assuming it doesn't need to be swapped
>> back in and the kernel has selected good candidates to swap out)
>>
>> o The test envs that host the undercloud and overcloud nodes have 64G
>> of RAM each, they each host 4 testenvs and each test env if running a
>> HA job can use up to 21G of RAM so we have over committed there, it
>> this is only a problem if a test env host gets 4 HA jobs that are
>> started around the same time (and as a result a each have 4 overcloud
>> nodes running at the same time), to allow this to happen without VM's
>> being killed by the OOM we've also enabled swap there. The majority of
>> the time this swap isn't in use, only if all 4 testenvs are being
>> simultaneously used and they are all running the second half of a CI
>> test at the same time.
>>
>> o The overcloud nodes are VM's running with a "unsafe" disk caching
>> mechanism, this causes sync requests from guest to be ignored and as a
>> result if the instances being hosted on these nodes are going into
>> swap this swap will be cached on the host as long as RAM is available.
>> i.e. swap being used in the undercloud or overcloud isn't being synced
>> to the disk on the host unless it has to be.
>>
>> o What I'd like us to avoid is simply bumping up the memory every time
>> we hit a OOM error without at least
>>   1. Explaining why we need more memory all of a sudden
>>   2. Looking into a way we may be able to avoid simply bumping the RAM
>> (at peak times we are memory constrained)
>>
>> as an example, Lets take a look at the swap usage on the undercloud of
>> a recent ci nonha job[1][2], These insances have 5G of RAM with 2G or
>> swap enabled via a swapfile
>> the overcloud deploy started @22:07:46 and finished at @22:28:06
>>
>> In the graph you'll see a spike in memory being swapped out around
>> 22:09, this corresponds almost exactly to when the overcloud image is
>> being downloaded from swift[3], looking the top output at the end of
>> the test you'll see that swift-proxy is using over 500M of Mem[4].
>>
>> I'd much prefer we spend time looking into why the swift proxy is
>> using this much memory rather then blindly bump the memory allocated
>> to the VM, perhaps we have something configured incorrectly or we've
>> hit a bug in swift.
>>
>> Having said all that we can bump the memory allocated to each node but
>> we have to accept 1 of 2 possible consequences
>> 1. We'll env up using the swap on the testenv hosts more then we
>> currently are or
>> 2. We'll have to reduce the number of test envs per host from 4 down
>> to 3, wiping 25% of our capacity
> 
> Thinking about this a little more, we could do a radical experiment
> for a week and just do this, i.e. bump up the RAM on each env and
> accept we loose 25 of our capacity, maybe it doesn't matter, if our
> success rate goes up then we'd be running less rechecks anyways.
> The downside is that we'd probably hit less timing errors (assuming
> the tight resources is whats showing them up), I say downside because
> this just means downstream users might hit them more often if CI
> isn't. Anyways maybe worth discussing at tomorrows meeting.

+1 to reducing the number of testenvs and allocating more memory to
each.  The huge number of rechecks we're having to do is definitely
contributing to our CI load in a big way, so if we could cut those down
by 50% I bet it would offset the lost testenvs.  And it would reduce
developer 

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-07 Thread Ben Nemec
On 03/07/2016 12:00 PM, Derek Higgins wrote:
> On 7 March 2016 at 12:11, John Trowbridge  wrote:
>>
>>
>> On 03/06/2016 11:58 AM, James Slagle wrote:
>>> On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  wrote:
 I'm kind of hijacking Dan's e-mail but I would like to propose some
 technical improvements to stop having so much CI failures.


 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
 mistake to swap on files because we don't have enough RAM. In my
 experience, swaping on non-SSD disks is even worst that not having
 enough RAM. We should stop doing that I think.
>>>
>>> We have been relying on swap in tripleo-ci for a little while. While
>>> not ideal, it has been an effective way to at least be able to test
>>> what we've been testing given the amount of physical RAM that is
>>> available.
>>>
>>> The recent change to add swap to the overcloud nodes has proved to be
>>> unstable. But that has more to do with it being racey with the
>>> validation deployment afaict. There are some patches currently up to
>>> address those issues.
>>>


 2/ Split CI jobs in scenarios.

 Currently we have CI jobs for ceph, HA, non-ha, containers and the
 current situation is that jobs fail randomly, due to performances issues.

 Puppet OpenStack CI had the same issue where we had one integration job
 and we never stopped adding more services until all becomes *very*
 unstable. We solved that issue by splitting the jobs and creating 
 scenarios:

 https://github.com/openstack/puppet-openstack-integration#description

 What I propose is to split TripleO jobs in more jobs, but with less
 services.

 The benefit of that:

 * more services coverage
 * jobs will run faster
 * less random issues due to bad performances

 The cost is of course it will consume more resources.
 That's why I suggest 3/.

 We could have:

 * HA job with ceph and a full compute scenario (glance, nova, cinder,
 ceilometer, aodh & gnocchi).
 * Same with IPv6 & SSL.
 * HA job without ceph and full compute scenario too
 * HA job without ceph and basic compute (glance and nova), with extra
 services like Trove, Sahara, etc.
 * ...
 (note: all jobs would have network isolation, which is to me a
 requirement when testing an installer like TripleO).
>>>
>>> Each of those jobs would at least require as much memory as our
>>> current HA job. I don't see how this gets us to using less memory. The
>>> HA job we have now already deploys the minimal amount of services that
>>> is possible given our current architecture. Without the composable
>>> service roles work, we can't deploy less services than we already are.
>>>
>>>
>>>

 3/ Drop non-ha job.
 I'm not sure why we have it, and the benefit of testing that comparing
 to HA.
>>>
>>> In my opinion, I actually think that we could drop the ceph and non-ha
>>> job from the check-tripleo queue.
>>>
>>> non-ha doesn't test anything realistic, and it doesn't really provide
>>> any faster feedback on patches. It seems at most it might run 15-20
>>> minutes faster than the HA job on average. Sometimes it even runs
>>> slower than the HA job.
>>>
>>> The ceph job we could move to the experimental queue to run on demand
>>> on patches that might affect ceph, and it could also be a daily
>>> periodic job.
>>>
>>> The same could be done for the containers job, an IPv6 job, and an
>>> upgrades job. Ideally with a way to run an individual job as needed.
>>> Would we need different experimental queues to do that?
>>>
>>> That would leave only the HA job in the check queue, which we should
>>> run with SSL and network isolation. We could deploy less testenv's
>>> since we'd have less jobs running, but give the ones we do deploy more
>>> RAM. I think this would really alleviate a lot of the transient
>>> intermittent failures we get in CI currently. It would also likely run
>>> faster.
>>>
>>> It's probably worth seeking out some exact evidence from the RDO
>>> centos-ci, because I think they are testing with virtual environments
>>> that have a lot more RAM than tripleo-ci does. It'd be good to
>>> understand if they have some of the transient failures that tripleo-ci
>>> does as well.
>>>
>>
>> The HA job in RDO CI is also more unstable than nonHA, although this is
>> usually not to do with memory contention. Most of the time that I see
>> the HA job fail spuriously in RDO CI, it is because of the Nova
>> scheduler race. I would bet that this race is the cause for the
>> fluctuating amount of time jobs take as well, because the recovery
>> mechanism for this is just to retry. Those retries can add 15 min. per
>> retry to the deploy. In RDO CI there is a 60min. timeout for deploy as
>> well. If we can't deploy to virtual machines in under an hour, to me
>> that is a bug. 

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-07 Thread Derek Higgins
On 7 March 2016 at 12:11, John Trowbridge  wrote:
>
>
> On 03/06/2016 11:58 AM, James Slagle wrote:
>> On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  wrote:
>>> I'm kind of hijacking Dan's e-mail but I would like to propose some
>>> technical improvements to stop having so much CI failures.
>>>
>>>
>>> 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
>>> mistake to swap on files because we don't have enough RAM. In my
>>> experience, swaping on non-SSD disks is even worst that not having
>>> enough RAM. We should stop doing that I think.
>>
>> We have been relying on swap in tripleo-ci for a little while. While
>> not ideal, it has been an effective way to at least be able to test
>> what we've been testing given the amount of physical RAM that is
>> available.
>>
>> The recent change to add swap to the overcloud nodes has proved to be
>> unstable. But that has more to do with it being racey with the
>> validation deployment afaict. There are some patches currently up to
>> address those issues.
>>
>>>
>>>
>>> 2/ Split CI jobs in scenarios.
>>>
>>> Currently we have CI jobs for ceph, HA, non-ha, containers and the
>>> current situation is that jobs fail randomly, due to performances issues.
>>>
>>> Puppet OpenStack CI had the same issue where we had one integration job
>>> and we never stopped adding more services until all becomes *very*
>>> unstable. We solved that issue by splitting the jobs and creating scenarios:
>>>
>>> https://github.com/openstack/puppet-openstack-integration#description
>>>
>>> What I propose is to split TripleO jobs in more jobs, but with less
>>> services.
>>>
>>> The benefit of that:
>>>
>>> * more services coverage
>>> * jobs will run faster
>>> * less random issues due to bad performances
>>>
>>> The cost is of course it will consume more resources.
>>> That's why I suggest 3/.
>>>
>>> We could have:
>>>
>>> * HA job with ceph and a full compute scenario (glance, nova, cinder,
>>> ceilometer, aodh & gnocchi).
>>> * Same with IPv6 & SSL.
>>> * HA job without ceph and full compute scenario too
>>> * HA job without ceph and basic compute (glance and nova), with extra
>>> services like Trove, Sahara, etc.
>>> * ...
>>> (note: all jobs would have network isolation, which is to me a
>>> requirement when testing an installer like TripleO).
>>
>> Each of those jobs would at least require as much memory as our
>> current HA job. I don't see how this gets us to using less memory. The
>> HA job we have now already deploys the minimal amount of services that
>> is possible given our current architecture. Without the composable
>> service roles work, we can't deploy less services than we already are.
>>
>>
>>
>>>
>>> 3/ Drop non-ha job.
>>> I'm not sure why we have it, and the benefit of testing that comparing
>>> to HA.
>>
>> In my opinion, I actually think that we could drop the ceph and non-ha
>> job from the check-tripleo queue.
>>
>> non-ha doesn't test anything realistic, and it doesn't really provide
>> any faster feedback on patches. It seems at most it might run 15-20
>> minutes faster than the HA job on average. Sometimes it even runs
>> slower than the HA job.
>>
>> The ceph job we could move to the experimental queue to run on demand
>> on patches that might affect ceph, and it could also be a daily
>> periodic job.
>>
>> The same could be done for the containers job, an IPv6 job, and an
>> upgrades job. Ideally with a way to run an individual job as needed.
>> Would we need different experimental queues to do that?
>>
>> That would leave only the HA job in the check queue, which we should
>> run with SSL and network isolation. We could deploy less testenv's
>> since we'd have less jobs running, but give the ones we do deploy more
>> RAM. I think this would really alleviate a lot of the transient
>> intermittent failures we get in CI currently. It would also likely run
>> faster.
>>
>> It's probably worth seeking out some exact evidence from the RDO
>> centos-ci, because I think they are testing with virtual environments
>> that have a lot more RAM than tripleo-ci does. It'd be good to
>> understand if they have some of the transient failures that tripleo-ci
>> does as well.
>>
>
> The HA job in RDO CI is also more unstable than nonHA, although this is
> usually not to do with memory contention. Most of the time that I see
> the HA job fail spuriously in RDO CI, it is because of the Nova
> scheduler race. I would bet that this race is the cause for the
> fluctuating amount of time jobs take as well, because the recovery
> mechanism for this is just to retry. Those retries can add 15 min. per
> retry to the deploy. In RDO CI there is a 60min. timeout for deploy as
> well. If we can't deploy to virtual machines in under an hour, to me
> that is a bug. (Note, I am speaking of `openstack overcloud deploy` when
> I say deploy, though start to finish can take less than an hour with
> decent CPUs)
>
> RDO CI uses the 

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-07 Thread Derek Higgins
On 7 March 2016 at 15:24, Derek Higgins  wrote:
> On 6 March 2016 at 16:58, James Slagle  wrote:
>> On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  wrote:
>>> I'm kind of hijacking Dan's e-mail but I would like to propose some
>>> technical improvements to stop having so much CI failures.
>>>
>>>
>>> 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
>>> mistake to swap on files because we don't have enough RAM. In my
>>> experience, swaping on non-SSD disks is even worst that not having
>>> enough RAM. We should stop doing that I think.
>>
>> We have been relying on swap in tripleo-ci for a little while. While
>> not ideal, it has been an effective way to at least be able to test
>> what we've been testing given the amount of physical RAM that is
>> available.
>
> Ok, so I have a few points here, in places where I'm making
> assumptions I'll try to point it out
>
> o Yes I agree using swap should be avoided if at all possible
>
> o We are currently looking into adding more RAM to our testenv hosts,
> it which point we can afford to be a little more liberal with Memory
> and this problem should become less of an issue, having said that
>
> o Even though using swap is bad, if we have some processes with a
> large Mem footprint that don't require constant access to a portion of
> the footprint swaping it out over the duration of the CI test isn't as
> expensive as it would suggest (assuming it doesn't need to be swapped
> back in and the kernel has selected good candidates to swap out)
>
> o The test envs that host the undercloud and overcloud nodes have 64G
> of RAM each, they each host 4 testenvs and each test env if running a
> HA job can use up to 21G of RAM so we have over committed there, it
> this is only a problem if a test env host gets 4 HA jobs that are
> started around the same time (and as a result a each have 4 overcloud
> nodes running at the same time), to allow this to happen without VM's
> being killed by the OOM we've also enabled swap there. The majority of
> the time this swap isn't in use, only if all 4 testenvs are being
> simultaneously used and they are all running the second half of a CI
> test at the same time.
>
> o The overcloud nodes are VM's running with a "unsafe" disk caching
> mechanism, this causes sync requests from guest to be ignored and as a
> result if the instances being hosted on these nodes are going into
> swap this swap will be cached on the host as long as RAM is available.
> i.e. swap being used in the undercloud or overcloud isn't being synced
> to the disk on the host unless it has to be.
>
> o What I'd like us to avoid is simply bumping up the memory every time
> we hit a OOM error without at least
>   1. Explaining why we need more memory all of a sudden
>   2. Looking into a way we may be able to avoid simply bumping the RAM
> (at peak times we are memory constrained)
>
> as an example, Lets take a look at the swap usage on the undercloud of
> a recent ci nonha job[1][2], These insances have 5G of RAM with 2G or
> swap enabled via a swapfile
> the overcloud deploy started @22:07:46 and finished at @22:28:06
>
> In the graph you'll see a spike in memory being swapped out around
> 22:09, this corresponds almost exactly to when the overcloud image is
> being downloaded from swift[3], looking the top output at the end of
> the test you'll see that swift-proxy is using over 500M of Mem[4].
>
> I'd much prefer we spend time looking into why the swift proxy is
> using this much memory rather then blindly bump the memory allocated
> to the VM, perhaps we have something configured incorrectly or we've
> hit a bug in swift.
>
> Having said all that we can bump the memory allocated to each node but
> we have to accept 1 of 2 possible consequences
> 1. We'll env up using the swap on the testenv hosts more then we
> currently are or
> 2. We'll have to reduce the number of test envs per host from 4 down
> to 3, wiping 25% of our capacity

Thinking about this a little more, we could do a radical experiment
for a week and just do this, i.e. bump up the RAM on each env and
accept we loose 25 of our capacity, maybe it doesn't matter, if our
success rate goes up then we'd be running less rechecks anyways.
The downside is that we'd probably hit less timing errors (assuming
the tight resources is whats showing them up), I say downside because
this just means downstream users might hit them more often if CI
isn't. Anyways maybe worth discussing at tomorrows meeting.


>
> [1] - 
> http://logs.openstack.org/85/289085/2/check-tripleo/gate-tripleo-ci-f22-nonha/6fda33c/
> [2] - http://goodsquishy.com/downloads/20160307/swap.png
> [3] - 22:09:03 21678 INFO [-] Master cache miss for image
> b6a96213-7955-4c4d-829e-871350939e03, starting download
>   22:09:41 21678 DEBUG [-] Running cmd (subprocess): qemu-img info
> /var/lib/ironic/master_images/tmpvjAlCU/b6a96213-7955-4c4d-829e-871350939e03.part

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-07 Thread Derek Higgins
On 6 March 2016 at 16:58, James Slagle  wrote:
> On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  wrote:
>> I'm kind of hijacking Dan's e-mail but I would like to propose some
>> technical improvements to stop having so much CI failures.
>>
>>
>> 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
>> mistake to swap on files because we don't have enough RAM. In my
>> experience, swaping on non-SSD disks is even worst that not having
>> enough RAM. We should stop doing that I think.
>
> We have been relying on swap in tripleo-ci for a little while. While
> not ideal, it has been an effective way to at least be able to test
> what we've been testing given the amount of physical RAM that is
> available.

Ok, so I have a few points here, in places where I'm making
assumptions I'll try to point it out

o Yes I agree using swap should be avoided if at all possible

o We are currently looking into adding more RAM to our testenv hosts,
it which point we can afford to be a little more liberal with Memory
and this problem should become less of an issue, having said that

o Even though using swap is bad, if we have some processes with a
large Mem footprint that don't require constant access to a portion of
the footprint swaping it out over the duration of the CI test isn't as
expensive as it would suggest (assuming it doesn't need to be swapped
back in and the kernel has selected good candidates to swap out)

o The test envs that host the undercloud and overcloud nodes have 64G
of RAM each, they each host 4 testenvs and each test env if running a
HA job can use up to 21G of RAM so we have over committed there, it
this is only a problem if a test env host gets 4 HA jobs that are
started around the same time (and as a result a each have 4 overcloud
nodes running at the same time), to allow this to happen without VM's
being killed by the OOM we've also enabled swap there. The majority of
the time this swap isn't in use, only if all 4 testenvs are being
simultaneously used and they are all running the second half of a CI
test at the same time.

o The overcloud nodes are VM's running with a "unsafe" disk caching
mechanism, this causes sync requests from guest to be ignored and as a
result if the instances being hosted on these nodes are going into
swap this swap will be cached on the host as long as RAM is available.
i.e. swap being used in the undercloud or overcloud isn't being synced
to the disk on the host unless it has to be.

o What I'd like us to avoid is simply bumping up the memory every time
we hit a OOM error without at least
  1. Explaining why we need more memory all of a sudden
  2. Looking into a way we may be able to avoid simply bumping the RAM
(at peak times we are memory constrained)

as an example, Lets take a look at the swap usage on the undercloud of
a recent ci nonha job[1][2], These insances have 5G of RAM with 2G or
swap enabled via a swapfile
the overcloud deploy started @22:07:46 and finished at @22:28:06

In the graph you'll see a spike in memory being swapped out around
22:09, this corresponds almost exactly to when the overcloud image is
being downloaded from swift[3], looking the top output at the end of
the test you'll see that swift-proxy is using over 500M of Mem[4].

I'd much prefer we spend time looking into why the swift proxy is
using this much memory rather then blindly bump the memory allocated
to the VM, perhaps we have something configured incorrectly or we've
hit a bug in swift.

Having said all that we can bump the memory allocated to each node but
we have to accept 1 of 2 possible consequences
1. We'll env up using the swap on the testenv hosts more then we
currently are or
2. We'll have to reduce the number of test envs per host from 4 down
to 3, wiping 25% of our capacity

[1] - 
http://logs.openstack.org/85/289085/2/check-tripleo/gate-tripleo-ci-f22-nonha/6fda33c/
[2] - http://goodsquishy.com/downloads/20160307/swap.png
[3] - 22:09:03 21678 INFO [-] Master cache miss for image
b6a96213-7955-4c4d-829e-871350939e03, starting download
  22:09:41 21678 DEBUG [-] Running cmd (subprocess): qemu-img info
/var/lib/ironic/master_images/tmpvjAlCU/b6a96213-7955-4c4d-829e-871350939e03.part
[4] - 17690 swift 20   0  804824 547724   1780 S   0.0 10.8
0:04.82 swift-prox+


>
> The recent change to add swap to the overcloud nodes has proved to be
> unstable. But that has more to do with it being racey with the
> validation deployment afaict. There are some patches currently up to
> address those issues.
>
>>
>>
>> 2/ Split CI jobs in scenarios.
>>
>> Currently we have CI jobs for ceph, HA, non-ha, containers and the
>> current situation is that jobs fail randomly, due to performances issues.

We don't know it due to performance issues, Your probably correct that
we wouldn't see them if we were allocating more resources to the ci
tests but this just means we have timing issues that are more
prevalent when resource 

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-07 Thread Dan Prince
On Sat, 2016-03-05 at 11:15 -0500, Emilien Macchi wrote:
> I'm kind of hijacking Dan's e-mail but I would like to propose some
> technical improvements to stop having so much CI failures.
> 
> 
> 1/ Stop creating swap files. We don't have SSD, this is IMHO a
> terrible
> mistake to swap on files because we don't have enough RAM. In my
> experience, swaping on non-SSD disks is even worst that not having
> enough RAM. We should stop doing that I think.
> 
> 
> 2/ Split CI jobs in scenarios.
> 
> Currently we have CI jobs for ceph, HA, non-ha, containers and the
> current situation is that jobs fail randomly, due to performances
> issues.
> 
> Puppet OpenStack CI had the same issue where we had one integration
> job
> and we never stopped adding more services until all becomes *very*
> unstable. We solved that issue by splitting the jobs and creating
> scenarios:
> 
> https://github.com/openstack/puppet-openstack-integration#description
> 
> What I propose is to split TripleO jobs in more jobs, but with less
> services.
> 
> The benefit of that:
> 
> * more services coverage
> * jobs will run faster
> * less random issues due to bad performances
> 
> The cost is of course it will consume more resources.
> That's why I suggest 3/.
> 
> We could have:
> 
> * HA job with ceph and a full compute scenario (glance, nova, cinder,
> ceilometer, aodh & gnocchi).
> * Same with IPv6 & SSL.
> * HA job without ceph and full compute scenario too
> * HA job without ceph and basic compute (glance and nova), with extra
> services like Trove, Sahara, etc.
> * ...
> (note: all jobs would have network isolation, which is to me a
> requirement when testing an installer like TripleO).

I'm not sure we have enough resources to entertain this option. I would
like to see us split the jobs up but not in exactly the way you
describe above. I would rather see us put the effort into architecture
changes like "split stack" which cloud allow us to test the
configuration side of our Heat stack on normal Cloud instances. Once we
have this in place I think we would have more potential resources and
could entertain running more jobs to and thus could split things out to
run in parallel if we choose to do so.

> 
> 3/ Drop non-ha job.
> I'm not sure why we have it, and the benefit of testing that
> comparing
> to HA.

A couple of reasons we have the nonha job I think. First is that not
everyone wants to use HA. We run our own TripleO CI cloud without HA at
this point and I think there is interest in maintaining this as a less
complex installation alternative where HA isn't needed.

Second is need to support functionally testing TripleO where developers
don't have enough resources for 3 controller nodes. At the very least
we'd need a second single node HA job (which wouldn't really be doing
HA) but would allow us to continue supporting the compressed
installation for developer testing, etc.

Dan

> 
> 
> Any comment / feedback is welcome,
> _
> _
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubs
> cribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-07 Thread John Trowbridge


On 03/06/2016 11:58 AM, James Slagle wrote:
> On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  wrote:
>> I'm kind of hijacking Dan's e-mail but I would like to propose some
>> technical improvements to stop having so much CI failures.
>>
>>
>> 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
>> mistake to swap on files because we don't have enough RAM. In my
>> experience, swaping on non-SSD disks is even worst that not having
>> enough RAM. We should stop doing that I think.
> 
> We have been relying on swap in tripleo-ci for a little while. While
> not ideal, it has been an effective way to at least be able to test
> what we've been testing given the amount of physical RAM that is
> available.
> 
> The recent change to add swap to the overcloud nodes has proved to be
> unstable. But that has more to do with it being racey with the
> validation deployment afaict. There are some patches currently up to
> address those issues.
> 
>>
>>
>> 2/ Split CI jobs in scenarios.
>>
>> Currently we have CI jobs for ceph, HA, non-ha, containers and the
>> current situation is that jobs fail randomly, due to performances issues.
>>
>> Puppet OpenStack CI had the same issue where we had one integration job
>> and we never stopped adding more services until all becomes *very*
>> unstable. We solved that issue by splitting the jobs and creating scenarios:
>>
>> https://github.com/openstack/puppet-openstack-integration#description
>>
>> What I propose is to split TripleO jobs in more jobs, but with less
>> services.
>>
>> The benefit of that:
>>
>> * more services coverage
>> * jobs will run faster
>> * less random issues due to bad performances
>>
>> The cost is of course it will consume more resources.
>> That's why I suggest 3/.
>>
>> We could have:
>>
>> * HA job with ceph and a full compute scenario (glance, nova, cinder,
>> ceilometer, aodh & gnocchi).
>> * Same with IPv6 & SSL.
>> * HA job without ceph and full compute scenario too
>> * HA job without ceph and basic compute (glance and nova), with extra
>> services like Trove, Sahara, etc.
>> * ...
>> (note: all jobs would have network isolation, which is to me a
>> requirement when testing an installer like TripleO).
> 
> Each of those jobs would at least require as much memory as our
> current HA job. I don't see how this gets us to using less memory. The
> HA job we have now already deploys the minimal amount of services that
> is possible given our current architecture. Without the composable
> service roles work, we can't deploy less services than we already are.
> 
> 
> 
>>
>> 3/ Drop non-ha job.
>> I'm not sure why we have it, and the benefit of testing that comparing
>> to HA.
> 
> In my opinion, I actually think that we could drop the ceph and non-ha
> job from the check-tripleo queue.
> 
> non-ha doesn't test anything realistic, and it doesn't really provide
> any faster feedback on patches. It seems at most it might run 15-20
> minutes faster than the HA job on average. Sometimes it even runs
> slower than the HA job.
> 
> The ceph job we could move to the experimental queue to run on demand
> on patches that might affect ceph, and it could also be a daily
> periodic job.
> 
> The same could be done for the containers job, an IPv6 job, and an
> upgrades job. Ideally with a way to run an individual job as needed.
> Would we need different experimental queues to do that?
> 
> That would leave only the HA job in the check queue, which we should
> run with SSL and network isolation. We could deploy less testenv's
> since we'd have less jobs running, but give the ones we do deploy more
> RAM. I think this would really alleviate a lot of the transient
> intermittent failures we get in CI currently. It would also likely run
> faster.
> 
> It's probably worth seeking out some exact evidence from the RDO
> centos-ci, because I think they are testing with virtual environments
> that have a lot more RAM than tripleo-ci does. It'd be good to
> understand if they have some of the transient failures that tripleo-ci
> does as well.
> 

The HA job in RDO CI is also more unstable than nonHA, although this is
usually not to do with memory contention. Most of the time that I see
the HA job fail spuriously in RDO CI, it is because of the Nova
scheduler race. I would bet that this race is the cause for the
fluctuating amount of time jobs take as well, because the recovery
mechanism for this is just to retry. Those retries can add 15 min. per
retry to the deploy. In RDO CI there is a 60min. timeout for deploy as
well. If we can't deploy to virtual machines in under an hour, to me
that is a bug. (Note, I am speaking of `openstack overcloud deploy` when
I say deploy, though start to finish can take less than an hour with
decent CPUs)

RDO CI uses the following layout:
Undercloud: 12G RAM, 4 CPUs
3x Control Nodes: 4G RAM, 1 CPU
Compute Node: 4G RAM, 1 CPU

Is there any ability in our current CI setup to auto-identify the cause
of a 

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-07 Thread Dmitry Tantsur

On 03/06/2016 05:58 PM, James Slagle wrote:

On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  wrote:

I'm kind of hijacking Dan's e-mail but I would like to propose some
technical improvements to stop having so much CI failures.


1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
mistake to swap on files because we don't have enough RAM. In my
experience, swaping on non-SSD disks is even worst that not having
enough RAM. We should stop doing that I think.


We have been relying on swap in tripleo-ci for a little while. While
not ideal, it has been an effective way to at least be able to test
what we've been testing given the amount of physical RAM that is
available.

The recent change to add swap to the overcloud nodes has proved to be
unstable. But that has more to do with it being racey with the
validation deployment afaict. There are some patches currently up to
address those issues.




2/ Split CI jobs in scenarios.

Currently we have CI jobs for ceph, HA, non-ha, containers and the
current situation is that jobs fail randomly, due to performances issues.

Puppet OpenStack CI had the same issue where we had one integration job
and we never stopped adding more services until all becomes *very*
unstable. We solved that issue by splitting the jobs and creating scenarios:

https://github.com/openstack/puppet-openstack-integration#description

What I propose is to split TripleO jobs in more jobs, but with less
services.

The benefit of that:

* more services coverage
* jobs will run faster
* less random issues due to bad performances

The cost is of course it will consume more resources.
That's why I suggest 3/.

We could have:

* HA job with ceph and a full compute scenario (glance, nova, cinder,
ceilometer, aodh & gnocchi).
* Same with IPv6 & SSL.
* HA job without ceph and full compute scenario too
* HA job without ceph and basic compute (glance and nova), with extra
services like Trove, Sahara, etc.
* ...
(note: all jobs would have network isolation, which is to me a
requirement when testing an installer like TripleO).


Each of those jobs would at least require as much memory as our
current HA job. I don't see how this gets us to using less memory. The
HA job we have now already deploys the minimal amount of services that
is possible given our current architecture. Without the composable
service roles work, we can't deploy less services than we already are.





3/ Drop non-ha job.
I'm not sure why we have it, and the benefit of testing that comparing
to HA.


In my opinion, I actually think that we could drop the ceph and non-ha
job from the check-tripleo queue.

non-ha doesn't test anything realistic, and it doesn't really provide
any faster feedback on patches. It seems at most it might run 15-20
minutes faster than the HA job on average. Sometimes it even runs
slower than the HA job.


The non-HA job is the only job with introspection. So you'll have to 
enable introspection on the HA job, bumping its run time.




The ceph job we could move to the experimental queue to run on demand
on patches that might affect ceph, and it could also be a daily
periodic job.

The same could be done for the containers job, an IPv6 job, and an
upgrades job. Ideally with a way to run an individual job as needed.
Would we need different experimental queues to do that?

That would leave only the HA job in the check queue, which we should
run with SSL and network isolation. We could deploy less testenv's
since we'd have less jobs running, but give the ones we do deploy more
RAM. I think this would really alleviate a lot of the transient
intermittent failures we get in CI currently. It would also likely run
faster.

It's probably worth seeking out some exact evidence from the RDO
centos-ci, because I think they are testing with virtual environments
that have a lot more RAM than tripleo-ci does. It'd be good to
understand if they have some of the transient failures that tripleo-ci
does as well.

We really are deploying on the absolute minimum cpu/ram requirements
that is even possible. I think it's unrealistic to expect a lot of
stability in that scenario. And I think that's a big reason why we get
so many transient failures.

In summary: give the testenv's more ram, have one job in the
check-tripleo queue, as many jobs as needed in the experimental queue,
and as many periodic jobs as necessary.





Any comment / feedback is welcome,
--
Emilien Macchi


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev








__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-06 Thread James Slagle
On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  wrote:
> I'm kind of hijacking Dan's e-mail but I would like to propose some
> technical improvements to stop having so much CI failures.
>
>
> 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
> mistake to swap on files because we don't have enough RAM. In my
> experience, swaping on non-SSD disks is even worst that not having
> enough RAM. We should stop doing that I think.

We have been relying on swap in tripleo-ci for a little while. While
not ideal, it has been an effective way to at least be able to test
what we've been testing given the amount of physical RAM that is
available.

The recent change to add swap to the overcloud nodes has proved to be
unstable. But that has more to do with it being racey with the
validation deployment afaict. There are some patches currently up to
address those issues.

>
>
> 2/ Split CI jobs in scenarios.
>
> Currently we have CI jobs for ceph, HA, non-ha, containers and the
> current situation is that jobs fail randomly, due to performances issues.
>
> Puppet OpenStack CI had the same issue where we had one integration job
> and we never stopped adding more services until all becomes *very*
> unstable. We solved that issue by splitting the jobs and creating scenarios:
>
> https://github.com/openstack/puppet-openstack-integration#description
>
> What I propose is to split TripleO jobs in more jobs, but with less
> services.
>
> The benefit of that:
>
> * more services coverage
> * jobs will run faster
> * less random issues due to bad performances
>
> The cost is of course it will consume more resources.
> That's why I suggest 3/.
>
> We could have:
>
> * HA job with ceph and a full compute scenario (glance, nova, cinder,
> ceilometer, aodh & gnocchi).
> * Same with IPv6 & SSL.
> * HA job without ceph and full compute scenario too
> * HA job without ceph and basic compute (glance and nova), with extra
> services like Trove, Sahara, etc.
> * ...
> (note: all jobs would have network isolation, which is to me a
> requirement when testing an installer like TripleO).

Each of those jobs would at least require as much memory as our
current HA job. I don't see how this gets us to using less memory. The
HA job we have now already deploys the minimal amount of services that
is possible given our current architecture. Without the composable
service roles work, we can't deploy less services than we already are.



>
> 3/ Drop non-ha job.
> I'm not sure why we have it, and the benefit of testing that comparing
> to HA.

In my opinion, I actually think that we could drop the ceph and non-ha
job from the check-tripleo queue.

non-ha doesn't test anything realistic, and it doesn't really provide
any faster feedback on patches. It seems at most it might run 15-20
minutes faster than the HA job on average. Sometimes it even runs
slower than the HA job.

The ceph job we could move to the experimental queue to run on demand
on patches that might affect ceph, and it could also be a daily
periodic job.

The same could be done for the containers job, an IPv6 job, and an
upgrades job. Ideally with a way to run an individual job as needed.
Would we need different experimental queues to do that?

That would leave only the HA job in the check queue, which we should
run with SSL and network isolation. We could deploy less testenv's
since we'd have less jobs running, but give the ones we do deploy more
RAM. I think this would really alleviate a lot of the transient
intermittent failures we get in CI currently. It would also likely run
faster.

It's probably worth seeking out some exact evidence from the RDO
centos-ci, because I think they are testing with virtual environments
that have a lot more RAM than tripleo-ci does. It'd be good to
understand if they have some of the transient failures that tripleo-ci
does as well.

We really are deploying on the absolute minimum cpu/ram requirements
that is even possible. I think it's unrealistic to expect a lot of
stability in that scenario. And I think that's a big reason why we get
so many transient failures.

In summary: give the testenv's more ram, have one job in the
check-tripleo queue, as many jobs as needed in the experimental queue,
and as many periodic jobs as necessary.


>
>
> Any comment / feedback is welcome,
> --
> Emilien Macchi
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



-- 
-- James Slagle
--

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [tripleo] CI jobs failures

2016-03-05 Thread Emilien Macchi
I'm kind of hijacking Dan's e-mail but I would like to propose some
technical improvements to stop having so much CI failures.


1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
mistake to swap on files because we don't have enough RAM. In my
experience, swaping on non-SSD disks is even worst that not having
enough RAM. We should stop doing that I think.


2/ Split CI jobs in scenarios.

Currently we have CI jobs for ceph, HA, non-ha, containers and the
current situation is that jobs fail randomly, due to performances issues.

Puppet OpenStack CI had the same issue where we had one integration job
and we never stopped adding more services until all becomes *very*
unstable. We solved that issue by splitting the jobs and creating scenarios:

https://github.com/openstack/puppet-openstack-integration#description

What I propose is to split TripleO jobs in more jobs, but with less
services.

The benefit of that:

* more services coverage
* jobs will run faster
* less random issues due to bad performances

The cost is of course it will consume more resources.
That's why I suggest 3/.

We could have:

* HA job with ceph and a full compute scenario (glance, nova, cinder,
ceilometer, aodh & gnocchi).
* Same with IPv6 & SSL.
* HA job without ceph and full compute scenario too
* HA job without ceph and basic compute (glance and nova), with extra
services like Trove, Sahara, etc.
* ...
(note: all jobs would have network isolation, which is to me a
requirement when testing an installer like TripleO).

3/ Drop non-ha job.
I'm not sure why we have it, and the benefit of testing that comparing
to HA.


Any comment / feedback is welcome,
-- 
Emilien Macchi



signature.asc
Description: OpenPGP digital signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [TripleO]: CI outage yesterday/today

2016-02-16 Thread Dan Prince
Just a quick update about the CI outage today and yesterday. Turns out
our jobs weren't running due to a bad Keystone URL (it was pointing to
localhost:5000 instead of our public SSL endpoint).

We've now fixed that issue and I'm told that as soon as Infra restarts
nodepool (they cache the keystone endpoints) we should start processing
jobs again.

Wait on it...

http://status.openstack.org/zuul/

Dan

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] TripleO: CI down... SSL cert expired

2015-04-13 Thread Derek Higgins

On 11/04/15 14:02, Dan Prince wrote:

Looks like our SSL certificate has expired for the currently active CI
cloud. We are working on getting a new one generated and installed.
Until then CI jobs won't get processed.


A new cert has been installed in the last few minutes and ZUUL has 
started kicking off new jobs so we should be through the backlog soon.


At this weeks meeting we'll discuss putting something in place to ensure 
we are ahead of this the next time.


Derek



Dan


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] TripleO: CI down... SSL cert expired

2015-04-11 Thread Dan Prince
Looks like our SSL certificate has expired for the currently active CI
cloud. We are working on getting a new one generated and installed.
Until then CI jobs won't get processed.

Dan


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] CI outage

2015-03-30 Thread Derek Higgins


Tl;dr tripleo ci is back up and running, see below for more

On 21/03/15 01:41, Dan Prince wrote:

Short version:

The RH1 CI region has been down since yesterday afternoon.

We have a misbehaving switch and have file a support ticket with the
vendor to troubleshoot things further. We hope to know more this
weekend, or Monday at the latest.

Long version:

Yesterday afternoon we started seeing issues in scheduling jobs on the
RH1 CI cloud. We haven't made any OpenStack configuration changes
recently, and things have been quite stable for some time now (our
uptime was 365 days on the controller).

Initially we found a misconfigured Keystone URL which was preventing
some diagnostic queries via OS clients external to the rack. This
setting hadn't been recently changed however and didn't seem to bother
nodepool before so I don't think it is the cause of the outage...

MySQL also got a bounce. It seemed happy enough after a restart as well.

After fixing the keystone setting and bouncing MySQL instances appears
to go ACTIVE but we were still having connectivity issues getting
floating IPs and DHCP working on overcloud instances. After a good bit
of debugging we started looking at the switches. Turns out one of them
has a high CPU usuage (above the warning threshold) and MAC address are
also unstable (ports are moving around).

Until this is resolved RH1 is unavailable to host jobs CI jobs. Will
post back here with an update once we have more information.


RH1 has been running as expected since last Thursday afternoon which 
means the cloud was down for almost a week, I'm left not entirely sure 
what some problems were, at various times during the week we tried a 
number of different interventions which may have caused (or exposed) 
some of our problems, e.g.


at one stage we restarted openvswitch in an attempt to ensure nothing 
had gone wrong with our ovs tunnels, around the same time (and possible 
caused by the restart), we started getting progressively worse 
connections to some of our servers. With lots of entries like this on 
our bastion server
Mar 20 13:22:49 host01-rack01 kernel: bond0.5: received packet with own 
address as source address


Not linking the restart with the looping packets message and instead 
thinking we may have a problem with the switch we put in a call with our 
switch vendor.


Continuing to chase down a problem on our own servers we noticed that 
tcpdump was reporting at times about 100,000 ARP packets per second 
(sometimes more).


Various interventions stopped the excess broadcast traffic e.g.
  Shutting down most of the compute nodes stopped the excess traffic, 
but the problem wasn't linked to any one particular compute node
  Running the tripleo os-refresh-config script on each compute node 
stopped the excess traffic


But restarting the controller node caused the excess traffic to return

Eventually we got the cloud running without the flood of broadcast 
traffic, with a small number of compute nodes, but instances still 
weren't getting IP address, with nova and neutron in debug mode we saw 
an error where nova failing to mount the qcow image (iirc it was 
attempting to resize the image).


Unable to figure out why this was working in the past but now isn't we 
redeployed this single compute node using the original image that was 
used (over a year ago), instances on this compute node we're booting but 
failing to get an IP address, we noticed this was because of a 
difference between the time on the controller when compared to the 
compute node. After resetting the time, now instances were booting and 
networking was working as expected (this was now Wednesday evening).


Looking back at the error while mounting the qcow image, I believe this 
was a red herring, it looks like this problem was always present on our 
system but we didn't have scary looking tracebacks in the logs until we 
switched to debug mode.


Now pretty confident we can get back to a running system by starting up 
all the compute nodes again and ensuring the os-refresh-config scripts 
were run then ensuring the times were all set on each host properly we 
decided to remove any entropy the may have built up while debugging 
problems on each computes node so we redeployed all of our compute nodes 
from scratch. This all went as expected but was a little time consuming 
as we spent time to verify each step as we went along, the steps went 
something like this


o with the exception of the overcloud controller, nova delete all of 
the hosts on the undercloud (31 hosts)


o we now have a problem, in tripleo the controller and compute nodes are 
tied together in a single heat template, so we need the heat template 
that was used a year ago to deploy the whole overcloud along with the 
parameters that were passed into it, we had actually done this before 
when adding new compute nodes to the cloud so it wasn't new territory.
   o Use heat template-show ci-overcloud to get the original heat 
template (a 

[openstack-dev] [TripleO] CI outage

2015-03-20 Thread Dan Prince
Short version:

The RH1 CI region has been down since yesterday afternoon.

We have a misbehaving switch and have file a support ticket with the
vendor to troubleshoot things further. We hope to know more this
weekend, or Monday at the latest.

Long version:

Yesterday afternoon we started seeing issues in scheduling jobs on the
RH1 CI cloud. We haven't made any OpenStack configuration changes
recently, and things have been quite stable for some time now (our
uptime was 365 days on the controller).

Initially we found a misconfigured Keystone URL which was preventing
some diagnostic queries via OS clients external to the rack. This
setting hadn't been recently changed however and didn't seem to bother
nodepool before so I don't think it is the cause of the outage...

MySQL also got a bounce. It seemed happy enough after a restart as well.

After fixing the keystone setting and bouncing MySQL instances appears
to go ACTIVE but we were still having connectivity issues getting
floating IPs and DHCP working on overcloud instances. After a good bit
of debugging we started looking at the switches. Turns out one of them
has a high CPU usuage (above the warning threshold) and MAC address are
also unstable (ports are moving around).

Until this is resolved RH1 is unavailable to host jobs CI jobs. Will
post back here with an update once we have more information.

Dan


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [TripleO] CI/CD report - 2014-12-20 - 2014-12-27

2014-12-27 Thread James Polley
It's been a bad week for CI, mostly due to setuptools.

Cores, please review https://review.openstack.org/#/c/144184/ immediately,
as CI is currently broken.

2014-12-19 - Neutron committed a change which had a symlink. This broke
pip install neutron, which broke CI for around 6 hours.

2014-12-22 - the release of pip 6.0 immediately triggered two distinct
issues. One was fixed by 6.0.2 being released, the other required a patch
to version specifiers in ceilometer. In total, CI was broken for around 24
hours.

2014-12-24 - new nodepool images were built which contained pip 6.0. This
triggered another issue related to the fact that pip now creates ~/.cache
if it doesn't already exist. At the time of writing, CI has been broken for
~3.5 days. https://review.openstack.org/#/c/144184/ seems to fix the
problem, but it needs to get review from cores before it can land.

Not listed here: setuptools 8.4 was released, then pulled after it was
found to have problems installing/upgrading many packages. Because our CI
was already broken, this had no noticeable effect.

As always, most of this information is pulled from DerekH's notes on
https://etherpad.openstack.org/p/tripleo-ci-breakages and more details can
be found there.
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [TripleO] CI/CD report - 2014-12-12 - 2014-12-19

2014-12-19 Thread James Polley
Two major CI outages this week

2014-12-12 - 2014-12-15 - pip install MySQL-python failing on fedora
- There was an updated mariadb-devel package, which caused pip install of
the python bindings to fail as gcc could not build using the provided
headers.
 - derekh put in a workaround on the 15th but we have to wait until
upstream provides a fixed package for a permanent resolution

2014-12-17 - failures in many projects on py33 tests
- Caused by an unexpected interaction between new features in pbr and the
way docutils handles python3 compatibility
- derekh resolved this by tweaking the build process to not build pbr -
just download the latest pbr from upstream

As always, more details can be found at
https://etherpad.openstack.org/p/tripleo-ci-breakages

The HP2 region is still struggling along trying to be built. I've created a
trello board at https://trello.com/b/MXbIP2qe/tripleo-cd to track current
roadblocks + the current outstanding patches we're using to build HP2.

If you're a CD admin and would like to help get HP2 up and running, take a
look at the board (and ping me when you hit something I've written in a way
that only makes sense if you already understand the problem). If you're not
a CD admin, a few of the patches need some simple tidyups.
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] CI/CD report - 2014-12-12 - 2014-12-19

2014-12-19 Thread Gregory Haynes
Excerpts from James Polley's message of 2014-12-19 17:10:41 +:
 Two major CI outages this week
 
 2014-12-12 - 2014-12-15 - pip install MySQL-python failing on fedora
 - There was an updated mariadb-devel package, which caused pip install of
 the python bindings to fail as gcc could not build using the provided
 headers.
  - derekh put in a workaround on the 15th but we have to wait until
 upstream provides a fixed package for a permanent resolution
 
 2014-12-17 - failures in many projects on py33 tests
 - Caused by an unexpected interaction between new features in pbr and the
 way docutils handles python3 compatibility
 - derekh resolved this by tweaking the build process to not build pbr -
 just download the latest pbr from upstream

I am a bad person and forgot to update our CI outage etherpad, but we
had another outage that was caused by the setuptools PEP440 breakage:

https://review.openstack.org/#/c/141659/

We might be able to revert this now if the world is fixed

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] CI report: 2014-12-5 - 2014-12-11

2014-12-13 Thread James Polley
Resending with correct subject tag. Never send email before coffee.

On Fri, Dec 12, 2014 at 9:33 AM, James Polley j...@jamezpolley.com wrote:

 In the week since the last email we've had no major CI failures. This
 makes it very easy for me to write my first CI report.

 There was a brief period where all the Ubuntu tests failed while an update
 was rolling out to various mirrors. DerekH worked around this quickly by
 dropping in a DNS hack, which remains in place. A long term fix for this
 problem probably involves setting up our own apt mirrors.

 check-tripleo-ironic-overcloud-precise-ha remains flaky, and hence
 non-voting.

 As always more details can be found here (although this week there's
 nothing to see)
 https://etherpad.openstack.org/p/tripleo-ci-breakages

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] CI report : 1/11/2014 - 4/12/2014

2014-12-08 Thread Derek Higgins
On 04/12/14 13:37, Dan Prince wrote:
 On Thu, 2014-12-04 at 11:51 +, Derek Higgins wrote:
 A month since my last update, sorry my bad

 since the last email we've had 5 incidents causing ci failures

 26/11/2014 : Lots of ubuntu jobs failed over 24 hours (maybe half)
 - We seem to suffer any time an ubuntu mirror isn't in sync causing hash
 mismatch errors. For now I've pinned DNS on our proxy to a specific
 server so we stop DNS round robining
 
 This sound fine to me. I personally like the model where you pin to a
 specific mirror, perhaps one that is geographically closer to your
 datacenter. This also makes Squid caching (in the rack) happier in some
 cases.
 

 21/11/2014 : All tripleo jobs failed for about 16 hours
 - Neutron started asserting that local_ip be set to a valid ip address,
 on the seed we had been leaving it blank
 - Cinder moved to using  oslo.concurreny which in turn requires that
 lock_path be set, we are now setting it
 
 
 Thinking about how we might catch these ahead of time with our limited
 resources ATM. These sorts of failures all seem related to configuration
 and or requirements changes. I wonder if we were to selectively
 (automatically) run check experimental jobs on all reviews with
 associated tickets which have either doc changes or modify
 requirements.txt. Probably a bit of work to pull this off but if we had
 a report containing these results coming down the pike we might be
 able to catch them ahead of time.
Yup, this sounds like it could be beneficial, alternatively if we soon
have the capacity to run on more projects (capacity is increasing) we'll
be running on all reviews and we'll be able to generate the report your
talking about, either way we should do something like this soon.

 
 

 8/11/2014 : All fedora tripleo jobs failed for about 60 hours (over a
 weekend)
 - A url being accessed on  https://bzr.linuxfoundation.org is no longer
 available, we removed the dependency

 7/11/2014 : All tripleo tests failed for about 24 hours
 - Options were removed from nova.conf that had been deprecated (although
 no deprecation warnings were being reported), we were still using these
 in tripleo

 as always more details can be found here
 https://etherpad.openstack.org/p/tripleo-ci-breakages
 
 Thanks for sending this out! Very useful.
no problem
 
 Dan
 

 thanks,
 Derek.

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] CI report : 1/11/2014 - 4/12/2014

2014-12-04 Thread Dan Prince
On Thu, 2014-12-04 at 11:51 +, Derek Higgins wrote:
 A month since my last update, sorry my bad
 
 since the last email we've had 5 incidents causing ci failures
 
 26/11/2014 : Lots of ubuntu jobs failed over 24 hours (maybe half)
 - We seem to suffer any time an ubuntu mirror isn't in sync causing hash
 mismatch errors. For now I've pinned DNS on our proxy to a specific
 server so we stop DNS round robining

This sound fine to me. I personally like the model where you pin to a
specific mirror, perhaps one that is geographically closer to your
datacenter. This also makes Squid caching (in the rack) happier in some
cases.

 
 21/11/2014 : All tripleo jobs failed for about 16 hours
 - Neutron started asserting that local_ip be set to a valid ip address,
 on the seed we had been leaving it blank
 - Cinder moved to using  oslo.concurreny which in turn requires that
 lock_path be set, we are now setting it


Thinking about how we might catch these ahead of time with our limited
resources ATM. These sorts of failures all seem related to configuration
and or requirements changes. I wonder if we were to selectively
(automatically) run check experimental jobs on all reviews with
associated tickets which have either doc changes or modify
requirements.txt. Probably a bit of work to pull this off but if we had
a report containing these results coming down the pike we might be
able to catch them ahead of time.


 
 8/11/2014 : All fedora tripleo jobs failed for about 60 hours (over a
 weekend)
 - A url being accessed on  https://bzr.linuxfoundation.org is no longer
 available, we removed the dependency
 
 7/11/2014 : All tripleo tests failed for about 24 hours
 - Options were removed from nova.conf that had been deprecated (although
 no deprecation warnings were being reported), we were still using these
 in tripleo
 
 as always more details can be found here
 https://etherpad.openstack.org/p/tripleo-ci-breakages

Thanks for sending this out! Very useful.

Dan

 
 thanks,
 Derek.
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] [CI] Cinder/Ceph CI setup

2014-12-02 Thread Giulio Fidente

On 11/27/2014 02:23 PM, Derek Higgins wrote:

On 27/11/14 10:21, Duncan Thomas wrote:

I'd suggest starting by making it an extra job, so that it can be
monitored for a while for stability without affecting what is there.


we have to be careful here, adding an extra job for this is probably the
safest option but tripleo CI resources are a constraint, for that reason
I would add it to the HA job (which is currently non voting) and once
its stable we should make it voting.



I'd be supportive of making it the default HA job in the longer term as
long as the LVM code is still getting tested somewhere - LVM is still
the reference implementation in cinder and after discussion there was
strong resistance to changing that.



We are and would continue to use lvm for our non ha jobs, If I
understand it correctly the tripleo lvm support isn't HA so continuing
to test it on our HA job doesn't achieve much.



I've no strong opinions on the node layout, I'll leave that to more
knowledgable people to discuss.

Is the ceph/tripleO code in a working state yet? Is there a guide to
using it?


hi guys, thanks for replying

I just wanted to add here a link to the blueprint so you can keep track 
of development [1]


all the code to make it happen (except the actual CI job config changes) 
is up for review now so feedback and reviews are indeed appreciated :)


1. https://blueprints.launchpad.net/tripleo/+spec/tripleo-kilo-cinder-ha
--
Giulio Fidente
GPG KEY: 08D733BA

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] [CI] Cinder/Ceph CI setup

2014-12-02 Thread Ben Nemec
On 11/27/2014 07:23 AM, Derek Higgins wrote:
 On 27/11/14 10:21, Duncan Thomas wrote:
 I'd suggest starting by making it an extra job, so that it can be
 monitored for a while for stability without affecting what is there.
 
 we have to be careful here, adding an extra job for this is probably the
 safest option but tripleo CI resources are a constraint, for that reason
 I would add it to the HA job (which is currently non voting) and once
 its stable we should make it voting.

The only problem is that the HA job has been non-voting for so long that
I don't think anyone pays attention to it.  That said, I don't have a
better suggestion because it makes no sense to run a Cinder HA job in a
non-HA CI run, so I guess until HA CI is fixed we're kind of stuck.

So +1 to making this the default in HA jobs.

 

 I'd be supportive of making it the default HA job in the longer term as
 long as the LVM code is still getting tested somewhere - LVM is still
 the reference implementation in cinder and after discussion there was
 strong resistance to changing that.
 We are and would continue to use lvm for our non ha jobs, If I
 understand it correctly the tripleo lvm support isn't HA so continuing
 to test it on our HA job doesn't achieve much.
 

 I've no strong opinions on the node layout, I'll leave that to more
 knowledgable people to discuss.

 Is the ceph/tripleO code in a working state yet? Is there a guide to
 using it?


 On 26 November 2014 at 13:10, Giulio Fidente gfide...@redhat.com
 mailto:gfide...@redhat.com wrote:

 hi there,

 while working on the TripleO cinder-ha spec meant to provide HA for
 Cinder via Ceph [1], we wondered how to (if at all) test this in CI,
 so we're looking for some feedback

 first of all, shall we make Cinder/Ceph the default for our
 (currently non-voting) HA job?
 (check-tripleo-ironic-__overcloud-precise-ha)

 current implementation (under review) should permit for the
 deployment of both the Ceph monitors and Ceph OSDs on either
 controllers, dedicated nodes, or to split them up so that only OSDs
 are on dedicated nodes

 what would be the best scenario for CI?

 * a single additional node hosting a Ceph OSD with the Ceph monitors
 deployed on all controllers (my preference is for this one)
 
 I would be happy with this so long as it didn't drastically increase the
 time to run the HA job.
 

 * a single additional node hosting a Ceph OSD and a Ceph monitor

 * no additional nodes with controllers also service as Ceph monitor
 and Ceph OSD

 more scenarios? comments? Thanks for helping

 1.
 https://blueprints.launchpad.__net/tripleo/+spec/tripleo-__kilo-cinder-ha
 https://blueprints.launchpad.net/tripleo/+spec/tripleo-kilo-cinder-ha
 -- 
 Giulio Fidente
 GPG KEY: 08D733BA

 _
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.__org
 mailto:OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/__cgi-bin/mailman/listinfo/__openstack-dev 
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




 -- 
 Duncan Thomas


 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] [CI] Cinder/Ceph CI setup

2014-11-27 Thread Duncan Thomas
I'd suggest starting by making it an extra job, so that it can be monitored
for a while for stability without affecting what is there.

I'd be supportive of making it the default HA job in the longer term as
long as the LVM code is still getting tested somewhere - LVM is still the
reference implementation in cinder and after discussion there was strong
resistance to changing that.

I've no strong opinions on the node layout, I'll leave that to more
knowledgable people to discuss.

Is the ceph/tripleO code in a working state yet? Is there a guide to using
it?


On 26 November 2014 at 13:10, Giulio Fidente gfide...@redhat.com wrote:

 hi there,

 while working on the TripleO cinder-ha spec meant to provide HA for Cinder
 via Ceph [1], we wondered how to (if at all) test this in CI, so we're
 looking for some feedback

 first of all, shall we make Cinder/Ceph the default for our (currently
 non-voting) HA job? (check-tripleo-ironic-overcloud-precise-ha)

 current implementation (under review) should permit for the deployment of
 both the Ceph monitors and Ceph OSDs on either controllers, dedicated
 nodes, or to split them up so that only OSDs are on dedicated nodes

 what would be the best scenario for CI?

 * a single additional node hosting a Ceph OSD with the Ceph monitors
 deployed on all controllers (my preference is for this one)

 * a single additional node hosting a Ceph OSD and a Ceph monitor

 * no additional nodes with controllers also service as Ceph monitor and
 Ceph OSD

 more scenarios? comments? Thanks for helping

 1. https://blueprints.launchpad.net/tripleo/+spec/tripleo-kilo-cinder-ha
 --
 Giulio Fidente
 GPG KEY: 08D733BA

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




-- 
Duncan Thomas
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] [CI] Cinder/Ceph CI setup

2014-11-27 Thread Derek Higgins
On 27/11/14 10:21, Duncan Thomas wrote:
 I'd suggest starting by making it an extra job, so that it can be
 monitored for a while for stability without affecting what is there.

we have to be careful here, adding an extra job for this is probably the
safest option but tripleo CI resources are a constraint, for that reason
I would add it to the HA job (which is currently non voting) and once
its stable we should make it voting.

 
 I'd be supportive of making it the default HA job in the longer term as
 long as the LVM code is still getting tested somewhere - LVM is still
 the reference implementation in cinder and after discussion there was
 strong resistance to changing that.
We are and would continue to use lvm for our non ha jobs, If I
understand it correctly the tripleo lvm support isn't HA so continuing
to test it on our HA job doesn't achieve much.

 
 I've no strong opinions on the node layout, I'll leave that to more
 knowledgable people to discuss.
 
 Is the ceph/tripleO code in a working state yet? Is there a guide to
 using it?
 
 
 On 26 November 2014 at 13:10, Giulio Fidente gfide...@redhat.com
 mailto:gfide...@redhat.com wrote:
 
 hi there,
 
 while working on the TripleO cinder-ha spec meant to provide HA for
 Cinder via Ceph [1], we wondered how to (if at all) test this in CI,
 so we're looking for some feedback
 
 first of all, shall we make Cinder/Ceph the default for our
 (currently non-voting) HA job?
 (check-tripleo-ironic-__overcloud-precise-ha)
 
 current implementation (under review) should permit for the
 deployment of both the Ceph monitors and Ceph OSDs on either
 controllers, dedicated nodes, or to split them up so that only OSDs
 are on dedicated nodes
 
 what would be the best scenario for CI?
 
 * a single additional node hosting a Ceph OSD with the Ceph monitors
 deployed on all controllers (my preference is for this one)

I would be happy with this so long as it didn't drastically increase the
time to run the HA job.

 
 * a single additional node hosting a Ceph OSD and a Ceph monitor
 
 * no additional nodes with controllers also service as Ceph monitor
 and Ceph OSD
 
 more scenarios? comments? Thanks for helping
 
 1.
 https://blueprints.launchpad.__net/tripleo/+spec/tripleo-__kilo-cinder-ha
 https://blueprints.launchpad.net/tripleo/+spec/tripleo-kilo-cinder-ha
 -- 
 Giulio Fidente
 GPG KEY: 08D733BA
 
 _
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.__org
 mailto:OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/__cgi-bin/mailman/listinfo/__openstack-dev 
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 
 
 
 -- 
 Duncan Thomas
 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [TripleO] [CI] Cinder/Ceph CI setup

2014-11-26 Thread Giulio Fidente

hi there,

while working on the TripleO cinder-ha spec meant to provide HA for 
Cinder via Ceph [1], we wondered how to (if at all) test this in CI, so 
we're looking for some feedback


first of all, shall we make Cinder/Ceph the default for our (currently 
non-voting) HA job? (check-tripleo-ironic-overcloud-precise-ha)


current implementation (under review) should permit for the deployment 
of both the Ceph monitors and Ceph OSDs on either controllers, dedicated 
nodes, or to split them up so that only OSDs are on dedicated nodes


what would be the best scenario for CI?

* a single additional node hosting a Ceph OSD with the Ceph monitors 
deployed on all controllers (my preference is for this one)


* a single additional node hosting a Ceph OSD and a Ceph monitor

* no additional nodes with controllers also service as Ceph monitor and 
Ceph OSD


more scenarios? comments? Thanks for helping

1. https://blueprints.launchpad.net/tripleo/+spec/tripleo-kilo-cinder-ha
--
Giulio Fidente
GPG KEY: 08D733BA

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [TripleO] CI report : 18/10/2014 - 31/10/2014

2014-11-06 Thread Derek Higgins
Hi All,

The week before last saw no problems with CI

But last week we had 3 separate problems causing tripleo CI tests to
fail until they were dealt with

1. pypi.openstack.org is no longer being maintained, which we were using
in tripleo-ci, we've now moved to pypi.python.org
2. nova started using oslo concurrency and in the process removed
nova/openstack/common/lockutils.py which was being imported from ironic
3. keystone removed a deprecated class which we had been using

See more details on the etherpad
https://etherpad.openstack.org/p/tripleo-ci-breakages

thanks,
Derek.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [TripleO] CI report : 04/10/2014 - 17/10/2014

2014-10-17 Thread Derek Higgins
Hi All,

   Nothing to report since the last report, 2 weeks of no breakages.

thanks,
Derek.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [TripleO] CI report : 27/09/2014 - 03/10/2014

2014-10-06 Thread Derek Higgins
Hi All,
There was 1 CI event last week,
regression in ironic https://bugs.launchpad.net/tripleo/+bug/1375641
All ironic tripleo CI tests failed for about 12 hours

For more info see https://etherpad.openstack.org/p/tripleo-ci-breakages

thanks,
Derek.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [TripleO] CI report : 24/09/2014 - 26/09/2014

2014-09-26 Thread Derek Higgins
Hi All,
On Wednesday, I started keeping a short summary of issues hit by
tripleo CI, so in time we can look back to properly assess the frequency
or problems along with their causes.

The list will be maintained here (most recent at the top)
https://etherpad.openstack.org/p/tripleo-ci-breakages

I'll also mail the list every week with a summary of any issues we've
hit since the last mail to give people an idea what kind of things
effect us (we usually hit at least one if not more regressions a week,
which can take our jobs out of action for anywhere between a few hours
to days).

We had 2 events to report this week, more details on etherpad[1]

26/9/2014 - All jobs failed that were running around 1AM UTC, there was
talk on #infra about a zuul reload at the time so I've put this down to
the restart and haven't looked any further.

24/9/2014 - Regression in horizon, all tripleo CI jobs failed for about
6 hours.

[1] - https://etherpad.openstack.org/p/tripleo-ci-breakages

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] CI status update for this week

2014-06-23 Thread Charles Crouch


- Original Message -
 
 
 Well we probably need some backwards compat glue to keep deploying supported
 versions. More on that in the spec I'm drafting.

A spec around deploying multiple versions of the overcloud? If so, great :-)

Re: https://bugs.launchpad.net/tripleo/+bug/1330735 and deploying older 
versions of the overcloud. Trying an icehouse overcloud using the latest 
(fixed) 
undercloud should just work right? Assuming the stable/icehouse branches of 
tripleo-image-elements were used to build the overcloud images and same 
for tripleo-heat-templates to deploy it?

 On 21 Jun 2014 12:26, Dan Prince  dpri...@redhat.com  wrote:
 
 
 On Fri, 2014-06-20 at 16:51 -0400, Charles Crouch wrote:
  
  - Original Message -
   Not a great week for TripleO CI. We had 3 different failures related to:
   
   Nova [1]: we were using a deprecated config option
   Heat [2]: missing heat data obtained from the Heat CFN API
   Neutron [3]: a broken GRE overlay network setup
  
  The last two are bugs, but is there anything tripleo can do about avoiding
  the first one in the future?:
 
 Yes. Reviewing and monitoring our log files would have been helpful
 here. Nova did nothing wrong... we were just plain using an old option
 which was deprecated in Icehouse.
 
 With TripleO's upstream focus we need to maintain a balancing act and
 try to avoid using new option names until a release has been made. I
 think once the release is made however (Icehouse in this case) we should
 immediately move to drop all deprecated options and use the new
 versions. If we follow a process like this we should be safe guarded
 from this sort of failure in the future.
 
 Dan
 
  e.g. reviewing a list of deprecated options and seeing when they will be
  removed.
  
  do the integrated projects have a protocol for when an option is deprecated
  and at what point it can be removed?
  e.g. if I make something deprecated in icehouse I can remove it in juno,
  but if I
  make something deprecated at the start of juno I can't remove it at the end
  of juno?
  
   
   The TripleO check jobs look to be running stable again today so if your
   patch had failures from earlier this week then recheck away (perhaps
   referencing one of these bugs if appropriate). The queue is fairly empty
   right now...
   
   Thanks for all the help in tracking these down and getting things fixed.
   
   [1] https://bugs.launchpad.net/tripleo/+bug/1292105
  
  I think [1] was meant to be
  https://bugs.launchpad.net/tripleo/+bug/1330735
  
   [2] https://bugs.launchpad.net/heat/+bug/1331720
   [3] https://bugs.launchpad.net/tripleo/+bug/1292105
   
   
   
   ___
   OpenStack-dev mailing list
   OpenStack-dev@lists.openstack.org
   http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
   
  
  ___
  OpenStack-dev mailing list
  OpenStack-dev@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [TripleO] CI status update for this week

2014-06-20 Thread Dan Prince
Not a great week for TripleO CI. We had 3 different failures related to:

 Nova [1]: we were using a deprecated config option
 Heat [2]: missing heat data obtained from the Heat CFN API
 Neutron [3]: a broken GRE overlay network setup

The TripleO check jobs look to be running stable again today so if your
patch had failures from earlier this week then recheck away (perhaps
referencing one of these bugs if appropriate). The queue is fairly empty
right now...

Thanks for all the help in tracking these down and getting things fixed.

[1] https://bugs.launchpad.net/tripleo/+bug/1292105
[2] https://bugs.launchpad.net/heat/+bug/1331720
[3] https://bugs.launchpad.net/tripleo/+bug/1292105



___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] CI status update for this week

2014-06-20 Thread Charles Crouch


- Original Message -
 Not a great week for TripleO CI. We had 3 different failures related to:
 
  Nova [1]: we were using a deprecated config option
  Heat [2]: missing heat data obtained from the Heat CFN API
  Neutron [3]: a broken GRE overlay network setup

The last two are bugs, but is there anything tripleo can do about avoiding the 
first one in the future?:
e.g. reviewing a list of deprecated options and seeing when they will be 
removed.

do the integrated projects have a protocol for when an option is deprecated and 
at what point it can be removed?
e.g. if I make something deprecated in icehouse I can remove it in juno, but if 
I
make something deprecated at the start of juno I can't remove it at the end of 
juno?

 
 The TripleO check jobs look to be running stable again today so if your
 patch had failures from earlier this week then recheck away (perhaps
 referencing one of these bugs if appropriate). The queue is fairly empty
 right now...
 
 Thanks for all the help in tracking these down and getting things fixed.
 
 [1] https://bugs.launchpad.net/tripleo/+bug/1292105

I think [1] was meant to be
https://bugs.launchpad.net/tripleo/+bug/1330735

 [2] https://bugs.launchpad.net/heat/+bug/1331720
 [3] https://bugs.launchpad.net/tripleo/+bug/1292105
 
 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] CI status update for this week

2014-06-20 Thread Joe Gordon
On Jun 20, 2014 1:52 PM, Charles Crouch ccro...@redhat.com wrote:



 - Original Message -
  Not a great week for TripleO CI. We had 3 different failures related to:
 
   Nova [1]: we were using a deprecated config option
   Heat [2]: missing heat data obtained from the Heat CFN API
   Neutron [3]: a broken GRE overlay network setup

 The last two are bugs, but is there anything tripleo can do about
avoiding the first one in the future?:
 e.g. reviewing a list of deprecated options and seeing when they will be
removed.
++

 do the integrated projects have a protocol for when an option is
deprecated and at what point it can be removed?
 e.g. if I make something deprecated in icehouse I can remove it in juno,
but if I
 make something deprecated at the start of juno I can't remove it at the
end of juno?

That is exactly what we do, deprecate for one release. This was the removal
of deprecated icehouse options patch.

 
  The TripleO check jobs look to be running stable again today so if your
  patch had failures from earlier this week then recheck away (perhaps
  referencing one of these bugs if appropriate). The queue is fairly empty
  right now...
 
  Thanks for all the help in tracking these down and getting things fixed.
 
  [1] https://bugs.launchpad.net/tripleo/+bug/1292105

 I think [1] was meant to be
 https://bugs.launchpad.net/tripleo/+bug/1330735

  [2] https://bugs.launchpad.net/heat/+bug/1331720
  [3] https://bugs.launchpad.net/tripleo/+bug/1292105
 
 
 
  ___
  OpenStack-dev mailing list
  OpenStack-dev@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] CI status update for this week

2014-06-20 Thread Charles Crouch


- Original Message -
 
 
 
 On Jun 20, 2014 1:52 PM, Charles Crouch  ccro...@redhat.com  wrote:
  
  
  
  - Original Message -
   Not a great week for TripleO CI. We had 3 different failures related to:
   
   Nova [1]: we were using a deprecated config option
   Heat [2]: missing heat data obtained from the Heat CFN API
   Neutron [3]: a broken GRE overlay network setup
  
  The last two are bugs, but is there anything tripleo can do about avoiding
  the first one in the future?:
  e.g. reviewing a list of deprecated options and seeing when they will be
  removed.
 ++
  
  do the integrated projects have a protocol for when an option is deprecated
  and at what point it can be removed?
  e.g. if I make something deprecated in icehouse I can remove it in juno,
  but if I
  make something deprecated at the start of juno I can't remove it at the end
  of juno?
 
 That is exactly what we do, deprecate for one release. This was the removal
 of deprecated icehouse options patch.

Ok great, and just to be clear I didn't mean to imply Nova did anything wrong 
here.
I'm looking for what tripleo can do to make sure it keeps up.

Is there any easy way to see what options have been deprecated in a release for
the integrated projects?
I guess a list would only need to be pulled together once at the end of each 
release.


  
   
   The TripleO check jobs look to be running stable again today so if your
   patch had failures from earlier this week then recheck away (perhaps
   referencing one of these bugs if appropriate). The queue is fairly empty
   right now...
   
   Thanks for all the help in tracking these down and getting things fixed.
   
   [1] https://bugs.launchpad.net/tripleo/+bug/1292105
  
  I think [1] was meant to be
  https://bugs.launchpad.net/tripleo/+bug/1330735
  
   [2] https://bugs.launchpad.net/heat/+bug/1331720
   [3] https://bugs.launchpad.net/tripleo/+bug/1292105
   
   
   
   ___
   OpenStack-dev mailing list
   OpenStack-dev@lists.openstack.org
   http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
   
  
  ___
  OpenStack-dev mailing list
  OpenStack-dev@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] CI status update for this week

2014-06-20 Thread Joe Gordon
On Fri, Jun 20, 2014 at 2:15 PM, Charles Crouch ccro...@redhat.com wrote:



 - Original Message -
 
 
 
  On Jun 20, 2014 1:52 PM, Charles Crouch  ccro...@redhat.com  wrote:
  
  
  
   - Original Message -
Not a great week for TripleO CI. We had 3 different failures related
 to:
   
Nova [1]: we were using a deprecated config option
Heat [2]: missing heat data obtained from the Heat CFN API
Neutron [3]: a broken GRE overlay network setup
  
   The last two are bugs, but is there anything tripleo can do about
 avoiding
   the first one in the future?:
   e.g. reviewing a list of deprecated options and seeing when they will
 be
   removed.
  ++
  
   do the integrated projects have a protocol for when an option is
 deprecated
   and at what point it can be removed?
   e.g. if I make something deprecated in icehouse I can remove it in
 juno,
   but if I
   make something deprecated at the start of juno I can't remove it at
 the end
   of juno?
 
  That is exactly what we do, deprecate for one release. This was the
 removal
  of deprecated icehouse options patch.

 Ok great, and just to be clear I didn't mean to imply Nova did anything
 wrong here.
 I'm looking for what tripleo can do to make sure it keeps up.

 Is there any easy way to see what options have been deprecated in a
 release for
 the integrated projects?
 I guess a list would only need to be pulled together once at the end of
 each release.


Yup, if you look at the config file you can generate it will tell you what
has been deprecated.




  
   
The TripleO check jobs look to be running stable again today so if
 your
patch had failures from earlier this week then recheck away (perhaps
referencing one of these bugs if appropriate). The queue is fairly
 empty
right now...
   
Thanks for all the help in tracking these down and getting things
 fixed.
   
[1] https://bugs.launchpad.net/tripleo/+bug/1292105
  
   I think [1] was meant to be
   https://bugs.launchpad.net/tripleo/+bug/1330735
  
[2] https://bugs.launchpad.net/heat/+bug/1331720
[3] https://bugs.launchpad.net/tripleo/+bug/1292105
   
   
   
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
   
  
   ___
   OpenStack-dev mailing list
   OpenStack-dev@lists.openstack.org
   http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
  ___
  OpenStack-dev mailing list
  OpenStack-dev@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] CI status update for this week

2014-06-20 Thread Clint Byrum
Excerpts from Charles Crouch's message of 2014-06-20 13:51:49 -0700:
 
 - Original Message -
  Not a great week for TripleO CI. We had 3 different failures related to:
  
   Nova [1]: we were using a deprecated config option
   Heat [2]: missing heat data obtained from the Heat CFN API
   Neutron [3]: a broken GRE overlay network setup
 
 The last two are bugs, but is there anything tripleo can do about avoiding 
 the first one in the future?:
 e.g. reviewing a list of deprecated options and seeing when they will be 
 removed.
 
 do the integrated projects have a protocol for when an option is deprecated 
 and at what point it can be removed?
 e.g. if I make something deprecated in icehouse I can remove it in juno, but 
 if I
 make something deprecated at the start of juno I can't remove it at the end 
 of juno?
 

Was this being logged as deprecated for a while? I think we probably
should aspire to fail CI if something starts printing out deprecation
warnings. We have a few more sprinkled here and there that I see in logs;
those are just ticking time bombs.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] CI status update for this week

2014-06-20 Thread Dan Prince
On Fri, 2014-06-20 at 16:51 -0400, Charles Crouch wrote:
 
 - Original Message -
  Not a great week for TripleO CI. We had 3 different failures related to:
  
   Nova [1]: we were using a deprecated config option
   Heat [2]: missing heat data obtained from the Heat CFN API
   Neutron [3]: a broken GRE overlay network setup
 
 The last two are bugs, but is there anything tripleo can do about avoiding 
 the first one in the future?:

Yes. Reviewing and monitoring our log files would have been helpful
here. Nova did nothing wrong... we were just plain using an old option
which was deprecated in Icehouse.

With TripleO's upstream focus we need to maintain a balancing act and
try to avoid using new option names until a release has been made. I
think once the release is made however (Icehouse in this case) we should
immediately move to drop all deprecated options and use the new
versions. If we follow a process like this we should be safe guarded
from this sort of failure in the future.

Dan

 e.g. reviewing a list of deprecated options and seeing when they will be 
 removed.
 
 do the integrated projects have a protocol for when an option is deprecated 
 and at what point it can be removed?
 e.g. if I make something deprecated in icehouse I can remove it in juno, but 
 if I
 make something deprecated at the start of juno I can't remove it at the end 
 of juno?
 
  
  The TripleO check jobs look to be running stable again today so if your
  patch had failures from earlier this week then recheck away (perhaps
  referencing one of these bugs if appropriate). The queue is fairly empty
  right now...
  
  Thanks for all the help in tracking these down and getting things fixed.
  
  [1] https://bugs.launchpad.net/tripleo/+bug/1292105
 
 I think [1] was meant to be
 https://bugs.launchpad.net/tripleo/+bug/1330735
 
  [2] https://bugs.launchpad.net/heat/+bug/1331720
  [3] https://bugs.launchpad.net/tripleo/+bug/1292105
  
  
  
  ___
  OpenStack-dev mailing list
  OpenStack-dev@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
  
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] CI status update for this week

2014-06-20 Thread Robert Collins
Well we probably need some backwards compat glue to keep deploying
supported versions. More on that in the spec I'm drafting.
On 21 Jun 2014 12:26, Dan Prince dpri...@redhat.com wrote:

 On Fri, 2014-06-20 at 16:51 -0400, Charles Crouch wrote:
 
  - Original Message -
   Not a great week for TripleO CI. We had 3 different failures related
 to:
  
Nova [1]: we were using a deprecated config option
Heat [2]: missing heat data obtained from the Heat CFN API
Neutron [3]: a broken GRE overlay network setup
 
  The last two are bugs, but is there anything tripleo can do about
 avoiding the first one in the future?:

 Yes. Reviewing and monitoring our log files would have been helpful
 here. Nova did nothing wrong... we were just plain using an old option
 which was deprecated in Icehouse.

 With TripleO's upstream focus we need to maintain a balancing act and
 try to avoid using new option names until a release has been made. I
 think once the release is made however (Icehouse in this case) we should
 immediately move to drop all deprecated options and use the new
 versions. If we follow a process like this we should be safe guarded
 from this sort of failure in the future.

 Dan

  e.g. reviewing a list of deprecated options and seeing when they will be
 removed.
 
  do the integrated projects have a protocol for when an option is
 deprecated and at what point it can be removed?
  e.g. if I make something deprecated in icehouse I can remove it in juno,
 but if I
  make something deprecated at the start of juno I can't remove it at the
 end of juno?
 
  
   The TripleO check jobs look to be running stable again today so if your
   patch had failures from earlier this week then recheck away (perhaps
   referencing one of these bugs if appropriate). The queue is fairly
 empty
   right now...
  
   Thanks for all the help in tracking these down and getting things
 fixed.
  
   [1] https://bugs.launchpad.net/tripleo/+bug/1292105
 
  I think [1] was meant to be
  https://bugs.launchpad.net/tripleo/+bug/1330735
 
   [2] https://bugs.launchpad.net/heat/+bug/1331720
   [3] https://bugs.launchpad.net/tripleo/+bug/1292105
  
  
  
   ___
   OpenStack-dev mailing list
   OpenStack-dev@lists.openstack.org
   http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
  
 
  ___
  OpenStack-dev mailing list
  OpenStack-dev@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [TripleO][CI] reduced capacity, rebuilding hp1 region

2014-05-29 Thread Robert Collins
Hi, the HP1 tripleo test cloud region has been systematically failing
and rather than flogging it along we're going to strip it down and
bring it back up with some of the improvements that have happened over
the last $months, as well as changing the undercloud to deploy via
Ironic and other goodness.

There's plenty to do to help move this along - we'll be spinning up a
list of automation issues and glitches that need fixing here -
https://etherpad.openstack.org/p/tripleo-ci-hp1-rebuild

My goal is to have the entirety of each step automated, so we're not
carrying odd quirks or workarounds.

If you are interested please do jump into #tripleo and chat to myself
or DerekH about how you can help out.

-Rob

-- 
Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] CI needs YOU

2014-05-27 Thread Robert Collins
Latest outage was due to nodepool having a stuck TCP connection to the
HP1 region again.

I've filed https://bugs.launchpad.net/python-novaclient/+bug/1323862
about it. If someone were to pick this up and run with it it would be
super useful.

-Rob

On 24 May 2014 05:01, Clint Byrum cl...@fewbar.com wrote:
 I forgot to include a link explaining our cloud:

 https://wiki.openstack.org/wiki/TripleO/TripleOCloud

 Thanks!

 Excerpts from Clint Byrum's message of 2014-05-22 15:24:05 -0700:
 Ahoy there, TripleO interested parties. In the last few months, we've
 gotten a relatively robust, though not nearly complete, CI system for
 TripleO. It is a bit unorthodox, as we have a strong desire to ensure
 PXE booting works, and that requires us running in our own cloud.

 We have this working, in two regions of TripleO deployed clouds which
 we manage ourselves. We've had quite a few issues, mostly hardware
 related, and some related to the fact that TripleO doesn't have HA yet,
 so our CI clouds go down whenever our controllers go down.

 Anyway, Derek Higgins, Dan Prince, Robert Collins, and myself, have been
 doing most of the heavy lifting on this. As a result, CI is not up and
 working all that often. It needs more operational support.

 So, I would encourage anyone interested in TripleO development to start
 working with us to maintain these two cloud regions (hopefully more
 regions will come up soon) so that we can keep CI flowing and expand
 coverage to include even more of TripleO.

 Thank you!

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



-- 
Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] CI needs YOU

2014-05-23 Thread Sanchez, Cristian A
Hi Clint, 
Please count me in.

Cristian

On 22/05/14 19:24, Clint Byrum cl...@fewbar.com wrote:

Ahoy there, TripleO interested parties. In the last few months, we've
gotten a relatively robust, though not nearly complete, CI system for
TripleO. It is a bit unorthodox, as we have a strong desire to ensure
PXE booting works, and that requires us running in our own cloud.

We have this working, in two regions of TripleO deployed clouds which
we manage ourselves. We've had quite a few issues, mostly hardware
related, and some related to the fact that TripleO doesn't have HA yet,
so our CI clouds go down whenever our controllers go down.

Anyway, Derek Higgins, Dan Prince, Robert Collins, and myself, have been
doing most of the heavy lifting on this. As a result, CI is not up and
working all that often. It needs more operational support.

So, I would encourage anyone interested in TripleO development to start
working with us to maintain these two cloud regions (hopefully more
regions will come up soon) so that we can keep CI flowing and expand
coverage to include even more of TripleO.

Thank you!

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] CI needs YOU

2014-05-23 Thread Clint Byrum
I forgot to include a link explaining our cloud:

https://wiki.openstack.org/wiki/TripleO/TripleOCloud

Thanks!

Excerpts from Clint Byrum's message of 2014-05-22 15:24:05 -0700:
 Ahoy there, TripleO interested parties. In the last few months, we've
 gotten a relatively robust, though not nearly complete, CI system for
 TripleO. It is a bit unorthodox, as we have a strong desire to ensure
 PXE booting works, and that requires us running in our own cloud.
 
 We have this working, in two regions of TripleO deployed clouds which
 we manage ourselves. We've had quite a few issues, mostly hardware
 related, and some related to the fact that TripleO doesn't have HA yet,
 so our CI clouds go down whenever our controllers go down.
 
 Anyway, Derek Higgins, Dan Prince, Robert Collins, and myself, have been
 doing most of the heavy lifting on this. As a result, CI is not up and
 working all that often. It needs more operational support.
 
 So, I would encourage anyone interested in TripleO development to start
 working with us to maintain these two cloud regions (hopefully more
 regions will come up soon) so that we can keep CI flowing and expand
 coverage to include even more of TripleO.
 
 Thank you!

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [TripleO] CI needs YOU

2014-05-22 Thread Clint Byrum
Ahoy there, TripleO interested parties. In the last few months, we've
gotten a relatively robust, though not nearly complete, CI system for
TripleO. It is a bit unorthodox, as we have a strong desire to ensure
PXE booting works, and that requires us running in our own cloud.

We have this working, in two regions of TripleO deployed clouds which
we manage ourselves. We've had quite a few issues, mostly hardware
related, and some related to the fact that TripleO doesn't have HA yet,
so our CI clouds go down whenever our controllers go down.

Anyway, Derek Higgins, Dan Prince, Robert Collins, and myself, have been
doing most of the heavy lifting on this. As a result, CI is not up and
working all that often. It needs more operational support.

So, I would encourage anyone interested in TripleO development to start
working with us to maintain these two cloud regions (hopefully more
regions will come up soon) so that we can keep CI flowing and expand
coverage to include even more of TripleO.

Thank you!

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [TripleO][CI] all overcloud jobs failing

2014-03-28 Thread Robert Collins
Swift changed the permissions on the swift ring object file which
broke tripleo deployments of swift. (root:root mode 0600 files are not
readable by the 'swift' user). We've got a patch in flight
(https://review.openstack.org/#/c/83645/) that will fix this, but
until that lands please don't spend a lot of time debugging why your
overcloud tests fail :). (Also please don't land any patch that might
affect the undercloud functionality or overcloud until the fix is
landed).

Btw Swift folk - 'check experimental' runs the tripleo jobs in all
projects, so if you any concerns about impacting deployments - please
run 'check experimental' before approving things ;)

-Rob

-- 
Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Tripleo][CI] check-tripleo outage

2014-02-26 Thread Derek Higgins
On 25/02/14 00:08, Robert Collins wrote:
 Today we had an outage of the tripleo test cloud :(.
 
 tl;dr:
  - we were down for 14 hours
  - we don't know the fundamental cause
  - infra were not inconvenienced - yaaay
  - its all ok now.
Looks like we've hit the same problem again tonight, I've
o rebooted the server
o fixed up the hostname
o restarted nova and neutron services on the controller

VM's still not getting IP's, I'm not seeing dhcp requests from them
coming into dnsmasq, spent some time trying to figure out the problem,
no luck, I'll pick up in this in a few hours if nobody else has before then.

 
 Read on for more information, what little we have.
 
 We don't know exactly why it happened yet, but the control plane
 dropped off the network. Console showed node still had a correct
 networking configuration, including openflow rules and bridges. The
 node was arpingable, and could arping out, but could not be pinged.
 Tcpdump showed the node sending a ping reply on it's raw ethernet
 device, but other machines on the same LAN did not see the packet.
 
 From syslog we can see
 Feb 24 06:28:31 ci-overcloud-notcompute0-gxezgcvv4v2q kernel:
 [1454708.543053] hpsa :06:00.0: cmd_alloc returned NULL!
 events
 
 around the time frame that the drop-off would have happened, but they
 go back many hours before and after that.
 
 After exhausting everything that came to mind we rebooted the machine,
 which promptly spat an NMI trace into the console:
 
 [1502354.552431]  [810fdf98] 
 rcu_eqs_enter_common.isra.43+0x208/0x220
 [1502354.552491]  [810ff9ed] rcu_irq_exit+0x5d/0x90
 [1502354.552549]  [81067670] irq_exit+0x80/0xc0
 [1502354.552605]  [816f9605] smp_apic_timer_interrupt+0x45/0x60
 [1502354.552665]  [816f7f9d] apic_timer_interrupt+0x6d/0x80
 [1502354.552722]  EOI  NMI  [816e1384] ? panic+0x193/0x1d7
 [1502354.552880]  [a02d18e5] hpwdt_pretimeout+0xe5/0xe5 [hpwdt]
 [1502354.552939]  [816efc88] nmi_handle.isra.3+0x88/0x180
 [1502354.552997]  [816eff11] do_nmi+0x191/0x330
 [1502354.553053]  [816ef201] end_repeat_nmi+0x1e/0x2e
 [1502354.553111]  [813d46c2] ? intel_idle+0xc2/0x120
 [1502354.553168]  [813d46c2] ? intel_idle+0xc2/0x120
 [1502354.553226]  [813d46c2] ? intel_idle+0xc2/0x120
 [1502354.553282]  EOE  [8159fe90] cpuidle_enter_state+0x40/0xc0
 [1502354.553408]  [8159ffd9] cpuidle_idle_call+0xc9/0x210
 [1502354.553466]  [8101bafe] arch_cpu_idle+0xe/0x30
 [1502354.553523]  [810b54c5] cpu_startup_entry+0xe5/0x280
 [1502354.553581]  [816d64b7] rest_init+0x77/0x80
 [1502354.553638]  [81d26ef7] start_kernel+0x40a/0x416
 [1502354.553695]  [81d268f6] ? repair_env_string+0x5c/0x5c
 [1502354.553753]  [81d26120] ? early_idt_handlers+0x120/0x120
 [1502354.553812]  [81d265de] x86_64_start_reservations+0x2a/0x2c
 [1502354.553871]  [81d266e8] x86_64_start_kernel+0x108/0x117
 [1502354.553929] ---[ end trace 166b62e89aa1f54b ]---
 
 'yay'. After that, a power reset in the console, it came up ok, just
 needed a minor nudge to refresh it's heat configuration and we were up
 and running again.
 
 For some reason, neutron decided to rename it's agents at this point
 and we had to remove and reattach the l3 agent before VM connectivity
 was restored.
 https://bugs.launchpad.net/tripleo/+bug/1284354
 
 However, about 90 nodepool nodes were stuck in states like ACTIVE
 deleting, and did not clear until we did a rolling restart of every
 nova compute process.
 https://bugs.launchpad.net/tripleo/+bug/1284356
 
 Cheers,
 Rob
 


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Tripleo][CI] check-tripleo outage

2014-02-26 Thread Robert Collins
Looking into it now.
On 27 Feb 2014 15:56, Derek Higgins der...@redhat.com wrote:

 On 25/02/14 00:08, Robert Collins wrote:
  Today we had an outage of the tripleo test cloud :(.
 
  tl;dr:
   - we were down for 14 hours
   - we don't know the fundamental cause
   - infra were not inconvenienced - yaaay
   - its all ok now.
 Looks like we've hit the same problem again tonight, I've
 o rebooted the server
 o fixed up the hostname
 o restarted nova and neutron services on the controller

 VM's still not getting IP's, I'm not seeing dhcp requests from them
 coming into dnsmasq, spent some time trying to figure out the problem,
 no luck, I'll pick up in this in a few hours if nobody else has before
 then.

 
  Read on for more information, what little we have.
 
  We don't know exactly why it happened yet, but the control plane
  dropped off the network. Console showed node still had a correct
  networking configuration, including openflow rules and bridges. The
  node was arpingable, and could arping out, but could not be pinged.
  Tcpdump showed the node sending a ping reply on it's raw ethernet
  device, but other machines on the same LAN did not see the packet.
 
  From syslog we can see
  Feb 24 06:28:31 ci-overcloud-notcompute0-gxezgcvv4v2q kernel:
  [1454708.543053] hpsa :06:00.0: cmd_alloc returned NULL!
  events
 
  around the time frame that the drop-off would have happened, but they
  go back many hours before and after that.
 
  After exhausting everything that came to mind we rebooted the machine,
  which promptly spat an NMI trace into the console:
 
  [1502354.552431]  [810fdf98]
 rcu_eqs_enter_common.isra.43+0x208/0x220
  [1502354.552491]  [810ff9ed] rcu_irq_exit+0x5d/0x90
  [1502354.552549]  [81067670] irq_exit+0x80/0xc0
  [1502354.552605]  [816f9605] smp_apic_timer_interrupt+0x45/0x60
  [1502354.552665]  [816f7f9d] apic_timer_interrupt+0x6d/0x80
  [1502354.552722]  EOI  NMI  [816e1384] ? panic+0x193/0x1d7
  [1502354.552880]  [a02d18e5] hpwdt_pretimeout+0xe5/0xe5 [hpwdt]
  [1502354.552939]  [816efc88] nmi_handle.isra.3+0x88/0x180
  [1502354.552997]  [816eff11] do_nmi+0x191/0x330
  [1502354.553053]  [816ef201] end_repeat_nmi+0x1e/0x2e
  [1502354.553111]  [813d46c2] ? intel_idle+0xc2/0x120
  [1502354.553168]  [813d46c2] ? intel_idle+0xc2/0x120
  [1502354.553226]  [813d46c2] ? intel_idle+0xc2/0x120
  [1502354.553282]  EOE  [8159fe90]
 cpuidle_enter_state+0x40/0xc0
  [1502354.553408]  [8159ffd9] cpuidle_idle_call+0xc9/0x210
  [1502354.553466]  [8101bafe] arch_cpu_idle+0xe/0x30
  [1502354.553523]  [810b54c5] cpu_startup_entry+0xe5/0x280
  [1502354.553581]  [816d64b7] rest_init+0x77/0x80
  [1502354.553638]  [81d26ef7] start_kernel+0x40a/0x416
  [1502354.553695]  [81d268f6] ? repair_env_string+0x5c/0x5c
  [1502354.553753]  [81d26120] ? early_idt_handlers+0x120/0x120
  [1502354.553812]  [81d265de]
 x86_64_start_reservations+0x2a/0x2c
  [1502354.553871]  [81d266e8] x86_64_start_kernel+0x108/0x117
  [1502354.553929] ---[ end trace 166b62e89aa1f54b ]---
 
  'yay'. After that, a power reset in the console, it came up ok, just
  needed a minor nudge to refresh it's heat configuration and we were up
  and running again.
 
  For some reason, neutron decided to rename it's agents at this point
  and we had to remove and reattach the l3 agent before VM connectivity
  was restored.
  https://bugs.launchpad.net/tripleo/+bug/1284354
 
  However, about 90 nodepool nodes were stuck in states like ACTIVE
  deleting, and did not clear until we did a rolling restart of every
  nova compute process.
  https://bugs.launchpad.net/tripleo/+bug/1284356
 
  Cheers,
  Rob
 


 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Tripleo][CI] check-tripleo outage

2014-02-26 Thread Robert Collins
On 27 February 2014 15:55, Derek Higgins der...@redhat.com wrote:
 On 25/02/14 00:08, Robert Collins wrote:
 Today we had an outage of the tripleo test cloud :(.

 tl;dr:
  - we were down for 14 hours
  - we don't know the fundamental cause
  - infra were not inconvenienced - yaaay
  - its all ok now.
 Looks like we've hit the same problem again tonight, I've
 o rebooted the server
 o fixed up the hostname
 o restarted nova and neutron services on the controller

 VM's still not getting IP's, I'm not seeing dhcp requests from them
 coming into dnsmasq, spent some time trying to figure out the problem,
 no luck, I'll pick up in this in a few hours if nobody else has before then.

https://bugs.launchpad.net/tripleo/+bug/1284354 - Neutron has decided
to toggle short and long name again:
| 0d00f56e-ca18-48c9-9552-4a5aad8f507d | L3 agent   |
ci-overcloud-notcompute0-gxezgcvv4v2q.novalocal | xxx   | True
  |
| 2c68c810-7cc5-4547-a106-ddb698ba8245 | DHCP agent |
ci-overcloud-notcompute0-gxezgcvv4v2q   | :-)   | True
  |
| 4887d74a-6394-4144-a994-104e092e956e | DHCP agent |
ci-overcloud-notcompute0-gxezgcvv4v2q.novalocal | xxx   | True
  |
| 8d405cc8-c5c8-429a-ae87-a67296d4249e | Open vSwitch agent |
ci-overcloud-notcompute0-gxezgcvv4v2q   | :-)   | True
  |
| b269d9ea-2435-4cd4-a227-42ee93ab9b62 | Metadata agent |
ci-overcloud-notcompute0-gxezgcvv4v2q.novalocal | xxx   | True
  |
| c7b80d9b-f3e8-4f15-9437-6c5b78bb6af2 | Open vSwitch agent |
ci-overcloud-notcompute0-gxezgcvv4v2q.novalocal | xxx   | True
  |
| dd9f7688-8135-4098-8eb2-23c9f9c6144c | Metadata agent |
ci-overcloud-notcompute0-gxezgcvv4v2q   | :-)   | True
  |
| e37a5df6-47a1-45a0-bf69-e0d1f0fe9553 | L3 agent   |
ci-overcloud-notcompute0-gxezgcvv4v2q   | :-)   | True
  |

I've moved the l3 router from the dead agent to the live agent and can
ssh into the broker.

Checking new instance connectivity next.

-Rob

-- 
Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Tripleo][CI] check-tripleo outage

2014-02-26 Thread Robert Collins
On 27 February 2014 20:35, Robert Collins robe...@robertcollins.net wrote:
'

 Checking new instance connectivity next.

DHCP is functional and no cloud-init errors, so we should be fully up.

-Rob


Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [Tripleo][CI] check-tripleo outage

2014-02-24 Thread Robert Collins
Today we had an outage of the tripleo test cloud :(.

tl;dr:
 - we were down for 14 hours
 - we don't know the fundamental cause
 - infra were not inconvenienced - yaaay
 - its all ok now.

Read on for more information, what little we have.

We don't know exactly why it happened yet, but the control plane
dropped off the network. Console showed node still had a correct
networking configuration, including openflow rules and bridges. The
node was arpingable, and could arping out, but could not be pinged.
Tcpdump showed the node sending a ping reply on it's raw ethernet
device, but other machines on the same LAN did not see the packet.

From syslog we can see
Feb 24 06:28:31 ci-overcloud-notcompute0-gxezgcvv4v2q kernel:
[1454708.543053] hpsa :06:00.0: cmd_alloc returned NULL!
events

around the time frame that the drop-off would have happened, but they
go back many hours before and after that.

After exhausting everything that came to mind we rebooted the machine,
which promptly spat an NMI trace into the console:

[1502354.552431]  [810fdf98] rcu_eqs_enter_common.isra.43+0x208/0x220
[1502354.552491]  [810ff9ed] rcu_irq_exit+0x5d/0x90
[1502354.552549]  [81067670] irq_exit+0x80/0xc0
[1502354.552605]  [816f9605] smp_apic_timer_interrupt+0x45/0x60
[1502354.552665]  [816f7f9d] apic_timer_interrupt+0x6d/0x80
[1502354.552722]  EOI  NMI  [816e1384] ? panic+0x193/0x1d7
[1502354.552880]  [a02d18e5] hpwdt_pretimeout+0xe5/0xe5 [hpwdt]
[1502354.552939]  [816efc88] nmi_handle.isra.3+0x88/0x180
[1502354.552997]  [816eff11] do_nmi+0x191/0x330
[1502354.553053]  [816ef201] end_repeat_nmi+0x1e/0x2e
[1502354.553111]  [813d46c2] ? intel_idle+0xc2/0x120
[1502354.553168]  [813d46c2] ? intel_idle+0xc2/0x120
[1502354.553226]  [813d46c2] ? intel_idle+0xc2/0x120
[1502354.553282]  EOE  [8159fe90] cpuidle_enter_state+0x40/0xc0
[1502354.553408]  [8159ffd9] cpuidle_idle_call+0xc9/0x210
[1502354.553466]  [8101bafe] arch_cpu_idle+0xe/0x30
[1502354.553523]  [810b54c5] cpu_startup_entry+0xe5/0x280
[1502354.553581]  [816d64b7] rest_init+0x77/0x80
[1502354.553638]  [81d26ef7] start_kernel+0x40a/0x416
[1502354.553695]  [81d268f6] ? repair_env_string+0x5c/0x5c
[1502354.553753]  [81d26120] ? early_idt_handlers+0x120/0x120
[1502354.553812]  [81d265de] x86_64_start_reservations+0x2a/0x2c
[1502354.553871]  [81d266e8] x86_64_start_kernel+0x108/0x117
[1502354.553929] ---[ end trace 166b62e89aa1f54b ]---

'yay'. After that, a power reset in the console, it came up ok, just
needed a minor nudge to refresh it's heat configuration and we were up
and running again.

For some reason, neutron decided to rename it's agents at this point
and we had to remove and reattach the l3 agent before VM connectivity
was restored.
https://bugs.launchpad.net/tripleo/+bug/1284354

However, about 90 nodepool nodes were stuck in states like ACTIVE
deleting, and did not clear until we did a rolling restart of every
nova compute process.
https://bugs.launchpad.net/tripleo/+bug/1284356

Cheers,
Rob

-- 
Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


<    1   2   3   4