Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-03-05 Thread Deepak Shetty
Update:

   Cinder - GlusterFS CI job (ubuntu based) was added as experimental (non
voting) to cinder project [1]
Its running successfully without any issue so far [2], [3]

We will monitor it for few days and if it continues to run fine, we will
propose a patch to make it check (voting)

[1]: https://review.openstack.org/160664
[2]: https://jenkins07.openstack.org/job/gate-tempest-dsvm-full-glusterfs/
[3]: https://jenkins02.openstack.org/job/gate-tempest-dsvm-full-glusterfs/

thanx,
deepak

On Fri, Feb 27, 2015 at 10:47 PM, Deepak Shetty  wrote:

>
>
> On Fri, Feb 27, 2015 at 4:02 PM, Deepak Shetty 
> wrote:
>
>>
>>
>> On Wed, Feb 25, 2015 at 11:48 PM, Deepak Shetty 
>> wrote:
>>
>>>
>>>
>>> On Wed, Feb 25, 2015 at 8:42 PM, Deepak Shetty 
>>> wrote:
>>>


 On Wed, Feb 25, 2015 at 6:34 PM, Jeremy Stanley 
 wrote:

> On 2015-02-25 17:02:34 +0530 (+0530), Deepak Shetty wrote:
> [...]
> > Run 2) We removed glusterfs backend, so Cinder was configured with
> > the default storage backend i.e. LVM. We re-created the OOM here
> > too
> >
> > So that proves that glusterfs doesn't cause it, as its happening
> > without glusterfs too.
>
> Well, if you re-ran the job on the same VM then the second result is
> potentially contaminated. Luckily this hypothesis can be confirmed
> by running the second test on a fresh VM in Rackspace.
>

 Maybe true, but we did the same on hpcloud provider VM too and both time
 it ran successfully with glusterfs as the cinder backend. Also before
 starting
 the 2nd run, we did unstack and saw that free memory did go back to 5G+
 and then re-invoked your script, I believe the contamination could
 result in some
 additional testcase failures (which we did see) but shouldn't be
 related to
 whether system can OOM or not, since thats a runtime thing.

 I see that the VM is up again. We will execute the 2nd run afresh now
 and update
 here.

>>>
>>> Ran tempest with configured with default backend i.e. LVM and was able
>>> to recreate
>>> the OOM issue, so running tempest without gluster against a fresh VM
>>> reliably
>>> recreates the OOM issue, snip below from syslog.
>>>
>>> Feb 25 16:58:37 devstack-centos7-rax-dfw-979654 kernel: glance-api
>>> invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
>>>
>>> Had a discussion with clarkb on IRC and given that F20 is discontinued,
>>> F21 has issues with tempest (under debug by ianw)
>>> and centos7 also has issues on rax (as evident from this thread), the
>>> only option left is to go with ubuntu based CI job, which
>>> BharatK is working on now.
>>>
>>
>> Quick Update:
>>
>> Cinder-GlusterFS CI job on ubuntu was added (
>> https://review.openstack.org/159217)
>>
>> We ran it 3 times against our stackforge repo patch @
>> https://review.openstack.org/159711
>> and it works fine (2 testcase failures, which are expected and we're
>> working towards fixing them)
>>
>> For the logs of the 3 experimental runs, look @
>>
>> http://logs.openstack.org/11/159711/1/experimental/gate-tempest-dsvm-full-glusterfs/
>>
>> Of the 3 jobs, 1 was schedued on rax and 2 on hpcloud, so its working
>> nicely across
>> the different cloud providers.
>>
>
> Clarkb, Fungi,
>   Given that the ubuntu job is stable, I would like to propose to add it
> as experimental to the
> openstack cinder while we work on fixing the 2 failed test cases in
> parallel
>
> thanx,
> deepak
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-27 Thread Deepak Shetty
On Fri, Feb 27, 2015 at 4:02 PM, Deepak Shetty  wrote:

>
>
> On Wed, Feb 25, 2015 at 11:48 PM, Deepak Shetty 
> wrote:
>
>>
>>
>> On Wed, Feb 25, 2015 at 8:42 PM, Deepak Shetty 
>> wrote:
>>
>>>
>>>
>>> On Wed, Feb 25, 2015 at 6:34 PM, Jeremy Stanley 
>>> wrote:
>>>
 On 2015-02-25 17:02:34 +0530 (+0530), Deepak Shetty wrote:
 [...]
 > Run 2) We removed glusterfs backend, so Cinder was configured with
 > the default storage backend i.e. LVM. We re-created the OOM here
 > too
 >
 > So that proves that glusterfs doesn't cause it, as its happening
 > without glusterfs too.

 Well, if you re-ran the job on the same VM then the second result is
 potentially contaminated. Luckily this hypothesis can be confirmed
 by running the second test on a fresh VM in Rackspace.

>>>
>>> Maybe true, but we did the same on hpcloud provider VM too and both time
>>> it ran successfully with glusterfs as the cinder backend. Also before
>>> starting
>>> the 2nd run, we did unstack and saw that free memory did go back to 5G+
>>> and then re-invoked your script, I believe the contamination could
>>> result in some
>>> additional testcase failures (which we did see) but shouldn't be related
>>> to
>>> whether system can OOM or not, since thats a runtime thing.
>>>
>>> I see that the VM is up again. We will execute the 2nd run afresh now
>>> and update
>>> here.
>>>
>>
>> Ran tempest with configured with default backend i.e. LVM and was able to
>> recreate
>> the OOM issue, so running tempest without gluster against a fresh VM
>> reliably
>> recreates the OOM issue, snip below from syslog.
>>
>> Feb 25 16:58:37 devstack-centos7-rax-dfw-979654 kernel: glance-api
>> invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
>>
>> Had a discussion with clarkb on IRC and given that F20 is discontinued,
>> F21 has issues with tempest (under debug by ianw)
>> and centos7 also has issues on rax (as evident from this thread), the
>> only option left is to go with ubuntu based CI job, which
>> BharatK is working on now.
>>
>
> Quick Update:
>
> Cinder-GlusterFS CI job on ubuntu was added (
> https://review.openstack.org/159217)
>
> We ran it 3 times against our stackforge repo patch @
> https://review.openstack.org/159711
> and it works fine (2 testcase failures, which are expected and we're
> working towards fixing them)
>
> For the logs of the 3 experimental runs, look @
>
> http://logs.openstack.org/11/159711/1/experimental/gate-tempest-dsvm-full-glusterfs/
>
> Of the 3 jobs, 1 was schedued on rax and 2 on hpcloud, so its working
> nicely across
> the different cloud providers.
>

Clarkb, Fungi,
  Given that the ubuntu job is stable, I would like to propose to add it as
experimental to the
openstack cinder while we work on fixing the 2 failed test cases in parallel

thanx,
deepak
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-27 Thread Deepak Shetty
On Wed, Feb 25, 2015 at 11:48 PM, Deepak Shetty  wrote:

>
>
> On Wed, Feb 25, 2015 at 8:42 PM, Deepak Shetty 
> wrote:
>
>>
>>
>> On Wed, Feb 25, 2015 at 6:34 PM, Jeremy Stanley 
>> wrote:
>>
>>> On 2015-02-25 17:02:34 +0530 (+0530), Deepak Shetty wrote:
>>> [...]
>>> > Run 2) We removed glusterfs backend, so Cinder was configured with
>>> > the default storage backend i.e. LVM. We re-created the OOM here
>>> > too
>>> >
>>> > So that proves that glusterfs doesn't cause it, as its happening
>>> > without glusterfs too.
>>>
>>> Well, if you re-ran the job on the same VM then the second result is
>>> potentially contaminated. Luckily this hypothesis can be confirmed
>>> by running the second test on a fresh VM in Rackspace.
>>>
>>
>> Maybe true, but we did the same on hpcloud provider VM too and both time
>> it ran successfully with glusterfs as the cinder backend. Also before
>> starting
>> the 2nd run, we did unstack and saw that free memory did go back to 5G+
>> and then re-invoked your script, I believe the contamination could result
>> in some
>> additional testcase failures (which we did see) but shouldn't be related
>> to
>> whether system can OOM or not, since thats a runtime thing.
>>
>> I see that the VM is up again. We will execute the 2nd run afresh now and
>> update
>> here.
>>
>
> Ran tempest with configured with default backend i.e. LVM and was able to
> recreate
> the OOM issue, so running tempest without gluster against a fresh VM
> reliably
> recreates the OOM issue, snip below from syslog.
>
> Feb 25 16:58:37 devstack-centos7-rax-dfw-979654 kernel: glance-api invoked
> oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
>
> Had a discussion with clarkb on IRC and given that F20 is discontinued,
> F21 has issues with tempest (under debug by ianw)
> and centos7 also has issues on rax (as evident from this thread), the only
> option left is to go with ubuntu based CI job, which
> BharatK is working on now.
>

Quick Update:

Cinder-GlusterFS CI job on ubuntu was added (
https://review.openstack.org/159217)

We ran it 3 times against our stackforge repo patch @
https://review.openstack.org/159711
and it works fine (2 testcase failures, which are expected and we're
working towards fixing them)

For the logs of the 3 experimental runs, look @
http://logs.openstack.org/11/159711/1/experimental/gate-tempest-dsvm-full-glusterfs/

Of the 3 jobs, 1 was schedued on rax and 2 on hpcloud, so its working
nicely across
the different cloud providers.

thanx,
deepak
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-26 Thread Clark Boylan


On Thu, Feb 26, 2015, at 03:03 AM, Deepak Shetty wrote:
> On Wed, Feb 25, 2015 at 6:11 AM, Jeremy Stanley 
> wrote:
> 
> > On 2015-02-25 01:02:07 +0530 (+0530), Bharat Kumar wrote:
> > [...]
> > > After running 971 test cases VM inaccessible for 569 ticks
> > [...]
> >
> > Glad you're able to reproduce it. For the record that is running
> > their 8GB performance flavor with a CentOS 7 PVHVM base image. The
> > steps to recreate are http://paste.openstack.org/show/181303/ as
> > discussed in IRC (for the sake of others following along). I've held
> > a similar worker in HPCloud (15.126.235.20) which is a 30GB flavor
> > artifically limited to 8GB through a kernel boot parameter.
> > Hopefully following the same steps there will help either confirm
> > the issue isn't specific to running in one particular service
> > provider, or will yield some useful difference which could help
> > highlight the cause.
> >
> > Either way, once 104.239.136.99 and 15.126.235.20 are no longer
> > needed, please let one of the infrastructure root admins know to
> > delete them.
> >
> 
> You can delete these VMs, wil request if needed again
I have marked these VMs for deletion and should be gone shortly. The new
experimental job is in place so you can start testing that against your
plugin with `check experimental` comments.

Clark

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-26 Thread Deepak Shetty
On Wed, Feb 25, 2015 at 6:11 AM, Jeremy Stanley  wrote:

> On 2015-02-25 01:02:07 +0530 (+0530), Bharat Kumar wrote:
> [...]
> > After running 971 test cases VM inaccessible for 569 ticks
> [...]
>
> Glad you're able to reproduce it. For the record that is running
> their 8GB performance flavor with a CentOS 7 PVHVM base image. The
> steps to recreate are http://paste.openstack.org/show/181303/ as
> discussed in IRC (for the sake of others following along). I've held
> a similar worker in HPCloud (15.126.235.20) which is a 30GB flavor
> artifically limited to 8GB through a kernel boot parameter.
> Hopefully following the same steps there will help either confirm
> the issue isn't specific to running in one particular service
> provider, or will yield some useful difference which could help
> highlight the cause.
>
> Either way, once 104.239.136.99 and 15.126.235.20 are no longer
> needed, please let one of the infrastructure root admins know to
> delete them.
>

You can delete these VMs, wil request if needed again

thanx,
deepak
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-25 Thread Deepak Shetty
On Wed, Feb 25, 2015 at 8:42 PM, Deepak Shetty  wrote:

>
>
> On Wed, Feb 25, 2015 at 6:34 PM, Jeremy Stanley  wrote:
>
>> On 2015-02-25 17:02:34 +0530 (+0530), Deepak Shetty wrote:
>> [...]
>> > Run 2) We removed glusterfs backend, so Cinder was configured with
>> > the default storage backend i.e. LVM. We re-created the OOM here
>> > too
>> >
>> > So that proves that glusterfs doesn't cause it, as its happening
>> > without glusterfs too.
>>
>> Well, if you re-ran the job on the same VM then the second result is
>> potentially contaminated. Luckily this hypothesis can be confirmed
>> by running the second test on a fresh VM in Rackspace.
>>
>
> Maybe true, but we did the same on hpcloud provider VM too and both time
> it ran successfully with glusterfs as the cinder backend. Also before
> starting
> the 2nd run, we did unstack and saw that free memory did go back to 5G+
> and then re-invoked your script, I believe the contamination could result
> in some
> additional testcase failures (which we did see) but shouldn't be related to
> whether system can OOM or not, since thats a runtime thing.
>
> I see that the VM is up again. We will execute the 2nd run afresh now and
> update
> here.
>

Ran tempest with configured with default backend i.e. LVM and was able to
recreate
the OOM issue, so running tempest without gluster against a fresh VM
reliably
recreates the OOM issue, snip below from syslog.

Feb 25 16:58:37 devstack-centos7-rax-dfw-979654 kernel: glance-api invoked
oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0

Had a discussion with clarkb on IRC and given that F20 is discontinued, F21
has issues with tempest (under debug by ianw)
and centos7 also has issues on rax (as evident from this thread), the only
option left is to go with ubuntu based CI job, which
BharatK is working on now.

thanx,
deepak


>
>
>>
>> > The VM (104.239.136.99) is now in such a bad shape that existing
>> > ssh sessions are no longer responding for a long long time now,
>> > tho' ping works. So need someone to help reboot/restart the VM so
>> > that we can collect the logs for records. Couldn't find anyone
>> > during apac TZ to get it reboot.
>> [...]
>>
>> According to novaclient that instance was in a "shutoff" state, and
>> so I had to nova reboot --hard to get it running. Looks like it's
>> back up and reachable again now.
>>
>
> Cool, thanks!
>
>
>>
>> > So from the above we can conclude that the tests are running fine
>> > on hpcloud and not on rax provider. Since the OS (centos7) inside
>> > the VM across provider is same, this now boils down to some issue
>> > with rax provider VM + centos7 combination.
>>
>> This certainly seems possible.
>>
>> > Another data point I could gather is:
>> > The only other centos7 job we have is
>> > check-tempest-dsvm-centos7 and it does not run full tempest
>> > looking at the job's config it only runs smoke tests (also
>> > confirmed the same with Ian W) which i believe is a subset of
>> > tests only.
>>
>> Correct, so if we confirm that we can't successfully run tempest
>> full on CentOS 7 in both of our providers yet, we should probably
>> think hard about the implications on yesterday's discussion as to
>> whether to set the smoke version gating on devstack and
>> devstack-gate changes.
>>
>> > So that brings to the conclusion that probably cinder-glusterfs CI
>> > job (check-tempest-dsvm-full-glusterfs-centos7) is the first
>> > centos7 based job running full tempest tests in upstream CI and
>> > hence is the first to hit the issue, but on rax provider only
>>
>> Entirely likely. As I mentioned last week, we don't yet have any
>> voting/gating jobs running on the platform as far as I can tell, so
>> it's still very much in an experimental stage.
>>
>
> So is there a way for a job to ask for hpcloud affinity, since thats where
> our
> job ran well (faster and only 2 failures, which were expected) ? I am not
> sure
> how easy and time consuming it would be to root cause why centos7 + rax
> provider
> is causing oom.
>
> Alternatively do you recommend using some other OS as the base for our job
> F20 or F21 or ubuntu ? I assume that there are other Jobs in rax provider
> that
> run on Fedora or Ubuntu with full tempest and don't OOM, would you know ?
>
> thanx,
> deepak
>
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-25 Thread Deepak Shetty
On Wed, Feb 25, 2015 at 6:34 PM, Jeremy Stanley  wrote:

> On 2015-02-25 17:02:34 +0530 (+0530), Deepak Shetty wrote:
> [...]
> > Run 2) We removed glusterfs backend, so Cinder was configured with
> > the default storage backend i.e. LVM. We re-created the OOM here
> > too
> >
> > So that proves that glusterfs doesn't cause it, as its happening
> > without glusterfs too.
>
> Well, if you re-ran the job on the same VM then the second result is
> potentially contaminated. Luckily this hypothesis can be confirmed
> by running the second test on a fresh VM in Rackspace.
>

Maybe true, but we did the same on hpcloud provider VM too and both time
it ran successfully with glusterfs as the cinder backend. Also before
starting
the 2nd run, we did unstack and saw that free memory did go back to 5G+
and then re-invoked your script, I believe the contamination could result
in some
additional testcase failures (which we did see) but shouldn't be related to
whether system can OOM or not, since thats a runtime thing.

I see that the VM is up again. We will execute the 2nd run afresh now and
update
here.


>
> > The VM (104.239.136.99) is now in such a bad shape that existing
> > ssh sessions are no longer responding for a long long time now,
> > tho' ping works. So need someone to help reboot/restart the VM so
> > that we can collect the logs for records. Couldn't find anyone
> > during apac TZ to get it reboot.
> [...]
>
> According to novaclient that instance was in a "shutoff" state, and
> so I had to nova reboot --hard to get it running. Looks like it's
> back up and reachable again now.
>

Cool, thanks!


>
> > So from the above we can conclude that the tests are running fine
> > on hpcloud and not on rax provider. Since the OS (centos7) inside
> > the VM across provider is same, this now boils down to some issue
> > with rax provider VM + centos7 combination.
>
> This certainly seems possible.
>
> > Another data point I could gather is:
> > The only other centos7 job we have is
> > check-tempest-dsvm-centos7 and it does not run full tempest
> > looking at the job's config it only runs smoke tests (also
> > confirmed the same with Ian W) which i believe is a subset of
> > tests only.
>
> Correct, so if we confirm that we can't successfully run tempest
> full on CentOS 7 in both of our providers yet, we should probably
> think hard about the implications on yesterday's discussion as to
> whether to set the smoke version gating on devstack and
> devstack-gate changes.
>
> > So that brings to the conclusion that probably cinder-glusterfs CI
> > job (check-tempest-dsvm-full-glusterfs-centos7) is the first
> > centos7 based job running full tempest tests in upstream CI and
> > hence is the first to hit the issue, but on rax provider only
>
> Entirely likely. As I mentioned last week, we don't yet have any
> voting/gating jobs running on the platform as far as I can tell, so
> it's still very much in an experimental stage.
>

So is there a way for a job to ask for hpcloud affinity, since thats where
our
job ran well (faster and only 2 failures, which were expected) ? I am not
sure
how easy and time consuming it would be to root cause why centos7 + rax
provider
is causing oom.

Alternatively do you recommend using some other OS as the base for our job
F20 or F21 or ubuntu ? I assume that there are other Jobs in rax provider
that
run on Fedora or Ubuntu with full tempest and don't OOM, would you know ?

thanx,
deepak
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-25 Thread Jeremy Stanley
On 2015-02-25 17:02:34 +0530 (+0530), Deepak Shetty wrote:
[...]
> Run 2) We removed glusterfs backend, so Cinder was configured with
> the default storage backend i.e. LVM. We re-created the OOM here
> too
> 
> So that proves that glusterfs doesn't cause it, as its happening
> without glusterfs too.

Well, if you re-ran the job on the same VM then the second result is
potentially contaminated. Luckily this hypothesis can be confirmed
by running the second test on a fresh VM in Rackspace.

> The VM (104.239.136.99) is now in such a bad shape that existing
> ssh sessions are no longer responding for a long long time now,
> tho' ping works. So need someone to help reboot/restart the VM so
> that we can collect the logs for records. Couldn't find anyone
> during apac TZ to get it reboot.
[...]

According to novaclient that instance was in a "shutoff" state, and
so I had to nova reboot --hard to get it running. Looks like it's
back up and reachable again now.

> So from the above we can conclude that the tests are running fine
> on hpcloud and not on rax provider. Since the OS (centos7) inside
> the VM across provider is same, this now boils down to some issue
> with rax provider VM + centos7 combination.

This certainly seems possible.

> Another data point I could gather is:
>     The only other centos7 job we have is
> check-tempest-dsvm-centos7 and it does not run full tempest
> looking at the job's config it only runs smoke tests (also
> confirmed the same with Ian W) which i believe is a subset of
> tests only.

Correct, so if we confirm that we can't successfully run tempest
full on CentOS 7 in both of our providers yet, we should probably
think hard about the implications on yesterday's discussion as to
whether to set the smoke version gating on devstack and
devstack-gate changes.

> So that brings to the conclusion that probably cinder-glusterfs CI
> job (check-tempest-dsvm-full-glusterfs-centos7) is the first
> centos7 based job running full tempest tests in upstream CI and
> hence is the first to hit the issue, but on rax provider only

Entirely likely. As I mentioned last week, we don't yet have any
voting/gating jobs running on the platform as far as I can tell, so
it's still very much in an experimental stage.
-- 
Jeremy Stanley

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-25 Thread Deepak Shetty
On Wed, Feb 25, 2015 at 6:11 AM, Jeremy Stanley  wrote:

> On 2015-02-25 01:02:07 +0530 (+0530), Bharat Kumar wrote:
> [...]
> > After running 971 test cases VM inaccessible for 569 ticks
> [...]
>
> Glad you're able to reproduce it. For the record that is running
> their 8GB performance flavor with a CentOS 7 PVHVM base image. The
>

So we had 2 runs in total in the rax provider VM and below are the results:

Run 1) It failed and re-created the OOM. The setup had glusterfs as a
storage
backend for Cinder.

[deepakcs@deepakcs r6-jeremy-rax-vm]$ grep oom-killer
run1-w-gluster/logs/syslog.txt
Feb 24 18:41:08 devstack-centos7-rax-dfw-979654.slave.openstack.org kernel:
mysqld invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0

Run 2) We *removed glusterfs backend*, so Cinder was configured with the
default
storage backend i.e. LVM. *We re-created the OOM here too*

So that proves that glusterfs doesn't cause it, as its happening without
glusterfs too.
The VM (104.239.136.99) is now in such a bad shape that existing ssh
sessions
are no longer responding for a long long time now, tho' ping works. So need
someone to
help reboot/restart the VM so that we can collect the logs for records.
Couldn't find anyone
during apac TZ to get it reboot.

We managed to get the below grep to work after a long time from another
terminal
to prove that oom did happen for run2

bash-4.2$ sudo cat /var/log/messages| grep oom-killer
Feb 25 08:53:16 devstack-centos7-rax-dfw-979654 kernel: ntpd invoked
oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Feb 25 09:03:35 devstack-centos7-rax-dfw-979654 kernel: beam.smp invoked
oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Feb 25 09:57:28 devstack-centos7-rax-dfw-979654 kernel: mysqld invoked
oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Feb 25 10:40:38 devstack-centos7-rax-dfw-979654 kernel: mysqld invoked
oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0


steps to recreate are http://paste.openstack.org/show/181303/ as
> discussed in IRC (for the sake of others following along). I've held
> a similar worker in HPCloud (15.126.235.20) which is a 30GB flavor
>

We ran 2 runs in total in the hpcloud provider VM (and this time it was
setup correctly with 8g ram, as evident from /proc/meminfo as well as dstat
output)

Run1) It was successfull. The setup had glusterfs as a storage
backend for Cinder. Only 2 testcases failed, they were expected. No oom
happened.

[deepakcs@deepakcs r7-jeremy-hpcloud-vm]$ grep oom-killer
run1-w-gluster/logs/syslog.txt
[deepakcs@deepakcs r7-jeremy-hpcloud-vm]$

Run 2) Since run1 went fine, we enabled tempest volume backup testcases too
and ran again.
It was successfull and no oom happened.

[deepakcs@deepakcs r7-jeremy-hpcloud-vm]$ grep oom-killer
run2-w-gluster/logs/syslog.txt
[deepakcs@deepakcs r7-jeremy-hpcloud-vm]$


> artifically limited to 8GB through a kernel boot parameter.
> Hopefully following the same steps there will help either confirm
> the issue isn't specific to running in one particular service
> provider, or will yield some useful difference which could help
> highlight the cause.
>

So from the above we can conclude that the tests are running fine on hpcloud
and not on rax provider. Since the OS (centos7) inside the VM across
provider is same,
this now boils down to some issue with rax provider VM + centos7
combination.

Another data point I could gather is:
The only other centos7 job we have is check-tempest-dsvm-centos7 and it
does not run full tempest
looking at the job's config it only runs smoke tests (also confirmed the
same with Ian W) which i believe
is a subset of tests only.

So that brings to the conclusion that probably cinder-glusterfs CI job
(check-tempest-dsvm-full-glusterfs-centos7) is the first centos7
based job running full tempest tests in upstream CI and hence is the first
to hit the issue , but on rax provider only

thanx,
deepak
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-24 Thread Jeremy Stanley
On 2015-02-25 01:02:07 +0530 (+0530), Bharat Kumar wrote:
[...]
> After running 971 test cases VM inaccessible for 569 ticks
[...]

Glad you're able to reproduce it. For the record that is running
their 8GB performance flavor with a CentOS 7 PVHVM base image. The
steps to recreate are http://paste.openstack.org/show/181303/ as
discussed in IRC (for the sake of others following along). I've held
a similar worker in HPCloud (15.126.235.20) which is a 30GB flavor
artifically limited to 8GB through a kernel boot parameter.
Hopefully following the same steps there will help either confirm
the issue isn't specific to running in one particular service
provider, or will yield some useful difference which could help
highlight the cause.

Either way, once 104.239.136.99 and 15.126.235.20 are no longer
needed, please let one of the infrastructure root admins know to
delete them.
-- 
Jeremy Stanley

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-24 Thread Bharat Kumar

Ran the job manually on rax VM, provided by Jeremy. (Thank you Jeremy).

After running 971 test cases VM inaccessible for 569 ticks, then 
continues... (Look at the console.log [1])

And also have a look at dstat log. [2]

The summary is:
==
Totals
==
Ran: 1125 tests in 5835. sec.
 - Passed: 960
 - Skipped: 88
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 77
Sum of execute time for each test: 13603.6755 sec.


[1] https://etherpad.openstack.org/p/rax_console.txt
[2] https://etherpad.openstack.org/p/rax_dstat.log

On 02/24/2015 07:03 PM, Deepak Shetty wrote:
FWIW, we tried to run our job in a rax provider VM (provided by ianw 
from his personal account)
and we ran the tempest tests twice, but the OOM did not re-create. Of 
the 2 runs, one of the run
used the same PYTHONHASHSEED as we had in one of the failed runs, 
still no oom.


Jeremy graciously agreed to provide us 2 VMs , one each from rax and 
hpcloud provider

to see if provider platform has anything to do with it.

So we plan to run again wtih the VMs given from Jeremy , post which i 
will send

next update here.

thanx,
deepak


On Tue, Feb 24, 2015 at 4:50 AM, Jeremy Stanley > wrote:


Due to an image setup bug (I have a fix proposed currently), I was
able to rerun this on a VM in HPCloud with 30GB memory and it
completed in about an hour with a couple of tempest tests failing.
Logs at: http://fungi.yuggoth.org/tmp/logs3.tar

Rerunning again on another 8GB Rackspace VM with the job timeout
increased to 5 hours, I was able to recreate the network
connectivity issues exhibited previously. The job itself seems to
have run for roughly 3 hours while failing 15 tests, and the worker
was mostly unreachable for a while at the end (I don't know exactly
how long) until around the time it completed. The OOM condition is
present this time too according to the logs, occurring right near
the end of the job. Collected logs are available at:
http://fungi.yuggoth.org/tmp/logs4.tar

Given the comparison between these two runs, I suspect this is
either caused by memory constraints or block device I/O performance
differences (or perhaps an unhappy combination of the two).
Hopefully a close review of the logs will indicate which.
--
Jeremy Stanley

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


--
Warm Regards,
Bharat Kumar Kobagana
Software Engineer
OpenStack Storage – RedHat India
Mobile - +91 9949278005

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-24 Thread Daniel P. Berrange
On Fri, Feb 20, 2015 at 10:49:29AM -0800, Joe Gordon wrote:
> On Fri, Feb 20, 2015 at 7:29 AM, Deepak Shetty  wrote:
> 
> > Hi Jeremy,
> >   Couldn't find anything strong in the logs to back the reason for OOM.
> > At the time OOM happens, mysqld and java processes have the most RAM hence
> > OOM selects mysqld (4.7G) to be killed.
> >
> > From a glusterfs backend perspective, i haven't found anything suspicious,
> > and we don't have the logs of glusterfs (which is typically in
> > /var/log/glusterfs) so can't delve inside glusterfs too much :(
> >
> > BharatK (in CC) also tried to re-create the issue in local VM setup, but
> > it hasn't yet!
> >
> > Having said that,* we do know* that we started seeing this issue after we
> > enabled the nova-assisted-snapshot tests (by changing nova' s policy.json
> > to enable non-admin to create hyp-assisted snaps). We think that enabling
> > online snaps might have added to the number of tests and memory load &
> > thats the only clue we have as of now!
> >
> >
> It looks like OOM killer hit while qemu was busy and during
> a ServerRescueTest. Maybe libvirt logs would be useful as well?
> 
> And I don't see any tempest tests calling assisted-volume-snapshots
> 
> Also this looks odd: Feb 19 18:47:16
> devstack-centos7-rax-iad-916633.slave.openstack.org libvirtd[3753]: missing
> __com.redhat_reason in disk io error event

So that specific error message is harmless - the __com.redhat_reason field
is nothing important from OpenStack's POV.

However, it is interesting that QEMU is seeing an I/O error in the first
place. This occurs when you have a grow on demand file, and the underlying
storage is full, so unable to allocate more blocks to cope with a guest
write. It can also occur if the underlying storage has a fatal I/O problem,
eg dead sector in harddisk, or the some equivalent.

IOW, I'd not expect to see any I/O errors raised from OpenStack in a normal
scenario. So this is something to consider investigating.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-24 Thread Deepak Shetty
FWIW, we tried to run our job in a rax provider VM (provided by ianw from
his personal account)
and we ran the tempest tests twice, but the OOM did not re-create. Of the 2
runs, one of the run
used the same PYTHONHASHSEED as we had in one of the failed runs, still no
oom.

Jeremy graciously agreed to provide us 2 VMs , one each from rax and
hpcloud provider
to see if provider platform has anything to do with it.

So we plan to run again wtih the VMs given from Jeremy , post which i will
send
next update here.

thanx,
deepak


On Tue, Feb 24, 2015 at 4:50 AM, Jeremy Stanley  wrote:

> Due to an image setup bug (I have a fix proposed currently), I was
> able to rerun this on a VM in HPCloud with 30GB memory and it
> completed in about an hour with a couple of tempest tests failing.
> Logs at: http://fungi.yuggoth.org/tmp/logs3.tar
>
> Rerunning again on another 8GB Rackspace VM with the job timeout
> increased to 5 hours, I was able to recreate the network
> connectivity issues exhibited previously. The job itself seems to
> have run for roughly 3 hours while failing 15 tests, and the worker
> was mostly unreachable for a while at the end (I don't know exactly
> how long) until around the time it completed. The OOM condition is
> present this time too according to the logs, occurring right near
> the end of the job. Collected logs are available at:
> http://fungi.yuggoth.org/tmp/logs4.tar
>
> Given the comparison between these two runs, I suspect this is
> either caused by memory constraints or block device I/O performance
> differences (or perhaps an unhappy combination of the two).
> Hopefully a close review of the logs will indicate which.
> --
> Jeremy Stanley
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-20 Thread Deepak Shetty
On Feb 21, 2015 12:26 AM, "Joe Gordon"  wrote:
>
>
>
> On Fri, Feb 20, 2015 at 7:29 AM, Deepak Shetty 
wrote:
>>
>> Hi Jeremy,
>>   Couldn't find anything strong in the logs to back the reason for OOM.
>> At the time OOM happens, mysqld and java processes have the most RAM
hence OOM selects mysqld (4.7G) to be killed.
>>
>> From a glusterfs backend perspective, i haven't found anything
suspicious, and we don't have the logs of glusterfs (which is typically in
/var/log/glusterfs) so can't delve inside glusterfs too much :(
>>
>> BharatK (in CC) also tried to re-create the issue in local VM setup, but
it hasn't yet!
>>
>> Having said that, we do know that we started seeing this issue after we
enabled the nova-assisted-snapshot tests (by changing nova' s policy.json
to enable non-admin to create hyp-assisted snaps). We think that enabling
online snaps might have added to the number of tests and memory load &
thats the only clue we have as of now!
>>
>
> It looks like OOM killer hit while qemu was busy and during
a ServerRescueTest. Maybe libvirt logs would be useful as well?

Thanks for the data point, will look at this test to understand more what's
happening

>
> And I don't see any tempest tests calling assisted-volume-snapshots

Maybe it still hasn't reached to it yet.

Thanks
Deepak

>
> Also this looks odd: Feb 19 18:47:16
devstack-centos7-rax-iad-916633.slave.openstack.org libvirtd[3753]: missing
__com.redhat_reason in disk io error event
>
>
>>
>> So :
>>
>>   1) BharatK  has merged the patch (
https://review.openstack.org/#/c/157707/ ) to revert the policy.json in the
glusterfs job. So no more nova-assisted-snap tests.
>>
>>   2) We also are increasing the timeout of our job in patch (
https://review.openstack.org/#/c/157835/1 ) so that we can get a full run
without timeouts to do a good analysis of the logs (logs are not posted if
the job times out)
>>
>> Can you please re-enable our job, so that we can confirm that disabling
online snap TCs is helping the issue, which if it does, can help us narrow
down the issue.
>>
>> We also plan to monitor & debug over the weekend hence having the job
enabled can help us a lot.
>>
>> thanx,
>> deepak
>>
>>
>> On Thu, Feb 19, 2015 at 10:37 PM, Jeremy Stanley 
wrote:
>>>
>>> On 2015-02-19 17:03:49 +0100 (+0100), Deepak Shetty wrote:
>>> [...]
>>> > For some reason we are seeing the centos7 glusterfs CI job getting
>>> > aborted/ killed either by Java exception or the build getting
>>> > aborted due to timeout.
>>> [...]
>>> > Hoping to root cause this soon and get the cinder-glusterfs CI job
>>> > back online soon.
>>>
>>> I manually reran the same commands this job runs on an identical
>>> virtual machine and was able to reproduce some substantial
>>> weirdness.
>>>
>>> I temporarily lost remote access to the VM around 108 minutes into
>>> running the job (~17:50 in the logs) and the out of band console
>>> also became unresponsive to carriage returns. The machine's IP
>>> address still responded to ICMP ping, but attempts to open new TCP
>>> sockets to the SSH service never got a protocol version banner back.
>>> After about 10 minutes of that I went out to lunch but left
>>> everything untouched. To my excitement it was up and responding
>>> again when I returned.
>>>
>>> It appears from the logs that it runs well past the 120-minute mark
>>> where devstack-gate tries to kill the gate hook for its configured
>>> timeout. Somewhere around 165 minutes in (18:47) you can see the
>>> kernel out-of-memory killer starts to kick in and kill httpd and
>>> mysqld processes according to the syslog. Hopefully this is enough
>>> additional detail to get you a start at finding the root cause so
>>> that we can reenable your job. Let me know if there's anything else
>>> you need for this.
>>>
>>> [1] http://fungi.yuggoth.org/tmp/logs.tar
>>> --
>>> Jeremy Stanley
>>>
>>>
__
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>>
>>
__
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-b

Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-20 Thread Deepak Shetty
On Feb 21, 2015 12:20 AM, "Jeremy Stanley"  wrote:
>
> On 2015-02-20 16:29:31 +0100 (+0100), Deepak Shetty wrote:
> > Couldn't find anything strong in the logs to back the reason for
> > OOM. At the time OOM happens, mysqld and java processes have the
> > most RAM hence OOM selects mysqld (4.7G) to be killed.
> [...]
>
> Today I reran it after you rolled back some additional tests, and it
> runs for about 117 minutes before the OOM killer shoots nova-compute
> in the head. At your request I've added /var/log/glusterfs into the
> tarball this time: http://fungi.yuggoth.org/tmp/logs2.tar

Thanks jeremy, can we get ssh access to one of these env to debug?

Thanks
Deepak

> --
> Jeremy Stanley
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-20 Thread Joe Gordon
On Fri, Feb 20, 2015 at 7:29 AM, Deepak Shetty  wrote:

> Hi Jeremy,
>   Couldn't find anything strong in the logs to back the reason for OOM.
> At the time OOM happens, mysqld and java processes have the most RAM hence
> OOM selects mysqld (4.7G) to be killed.
>
> From a glusterfs backend perspective, i haven't found anything suspicious,
> and we don't have the logs of glusterfs (which is typically in
> /var/log/glusterfs) so can't delve inside glusterfs too much :(
>
> BharatK (in CC) also tried to re-create the issue in local VM setup, but
> it hasn't yet!
>
> Having said that,* we do know* that we started seeing this issue after we
> enabled the nova-assisted-snapshot tests (by changing nova' s policy.json
> to enable non-admin to create hyp-assisted snaps). We think that enabling
> online snaps might have added to the number of tests and memory load &
> thats the only clue we have as of now!
>
>
It looks like OOM killer hit while qemu was busy and during
a ServerRescueTest. Maybe libvirt logs would be useful as well?

And I don't see any tempest tests calling assisted-volume-snapshots

Also this looks odd: Feb 19 18:47:16
devstack-centos7-rax-iad-916633.slave.openstack.org libvirtd[3753]: missing
__com.redhat_reason in disk io error event



> So :
>
>   1) BharatK  has merged the patch (
> https://review.openstack.org/#/c/157707/ ) to revert the policy.json in
> the glusterfs job. So no more nova-assisted-snap tests.
>
>   2) We also are increasing the timeout of our job in patch (
> https://review.openstack.org/#/c/157835/1 ) so that we can get a full run
> without timeouts to do a good analysis of the logs (logs are not posted if
> the job times out)
>
> Can you please re-enable our job, so that we can confirm that disabling
> online snap TCs is helping the issue, which if it does, can help us narrow
> down the issue.
>
> We also plan to monitor & debug over the weekend hence having the job
> enabled can help us a lot.
>
> thanx,
> deepak
>
>
> On Thu, Feb 19, 2015 at 10:37 PM, Jeremy Stanley 
> wrote:
>
>> On 2015-02-19 17:03:49 +0100 (+0100), Deepak Shetty wrote:
>> [...]
>> > For some reason we are seeing the centos7 glusterfs CI job getting
>> > aborted/ killed either by Java exception or the build getting
>> > aborted due to timeout.
>> [...]
>> > Hoping to root cause this soon and get the cinder-glusterfs CI job
>> > back online soon.
>>
>> I manually reran the same commands this job runs on an identical
>> virtual machine and was able to reproduce some substantial
>> weirdness.
>>
>> I temporarily lost remote access to the VM around 108 minutes into
>> running the job (~17:50 in the logs) and the out of band console
>> also became unresponsive to carriage returns. The machine's IP
>> address still responded to ICMP ping, but attempts to open new TCP
>> sockets to the SSH service never got a protocol version banner back.
>> After about 10 minutes of that I went out to lunch but left
>> everything untouched. To my excitement it was up and responding
>> again when I returned.
>>
>> It appears from the logs that it runs well past the 120-minute mark
>> where devstack-gate tries to kill the gate hook for its configured
>> timeout. Somewhere around 165 minutes in (18:47) you can see the
>> kernel out-of-memory killer starts to kick in and kill httpd and
>> mysqld processes according to the syslog. Hopefully this is enough
>> additional detail to get you a start at finding the root cause so
>> that we can reenable your job. Let me know if there's anything else
>> you need for this.
>>
>> [1] http://fungi.yuggoth.org/tmp/logs.tar
>> --
>> Jeremy Stanley
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-20 Thread Jeremy Stanley
On 2015-02-20 16:29:31 +0100 (+0100), Deepak Shetty wrote:
> Couldn't find anything strong in the logs to back the reason for
> OOM. At the time OOM happens, mysqld and java processes have the
> most RAM hence OOM selects mysqld (4.7G) to be killed.
[...]

Today I reran it after you rolled back some additional tests, and it
runs for about 117 minutes before the OOM killer shoots nova-compute
in the head. At your request I've added /var/log/glusterfs into the
tarball this time: http://fungi.yuggoth.org/tmp/logs2.tar
-- 
Jeremy Stanley

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-20 Thread Deepak Shetty
Hi Jeremy,
  Couldn't find anything strong in the logs to back the reason for OOM.
At the time OOM happens, mysqld and java processes have the most RAM hence
OOM selects mysqld (4.7G) to be killed.

>From a glusterfs backend perspective, i haven't found anything suspicious,
and we don't have the logs of glusterfs (which is typically in
/var/log/glusterfs) so can't delve inside glusterfs too much :(

BharatK (in CC) also tried to re-create the issue in local VM setup, but it
hasn't yet!

Having said that,* we do know* that we started seeing this issue after we
enabled the nova-assisted-snapshot tests (by changing nova' s policy.json
to enable non-admin to create hyp-assisted snaps). We think that enabling
online snaps might have added to the number of tests and memory load &
thats the only clue we have as of now!

So :

  1) BharatK  has merged the patch (
https://review.openstack.org/#/c/157707/ ) to revert the policy.json in the
glusterfs job. So no more nova-assisted-snap tests.

  2) We also are increasing the timeout of our job in patch (
https://review.openstack.org/#/c/157835/1 ) so that we can get a full run
without timeouts to do a good analysis of the logs (logs are not posted if
the job times out)

Can you please re-enable our job, so that we can confirm that disabling
online snap TCs is helping the issue, which if it does, can help us narrow
down the issue.

We also plan to monitor & debug over the weekend hence having the job
enabled can help us a lot.

thanx,
deepak


On Thu, Feb 19, 2015 at 10:37 PM, Jeremy Stanley  wrote:

> On 2015-02-19 17:03:49 +0100 (+0100), Deepak Shetty wrote:
> [...]
> > For some reason we are seeing the centos7 glusterfs CI job getting
> > aborted/ killed either by Java exception or the build getting
> > aborted due to timeout.
> [...]
> > Hoping to root cause this soon and get the cinder-glusterfs CI job
> > back online soon.
>
> I manually reran the same commands this job runs on an identical
> virtual machine and was able to reproduce some substantial
> weirdness.
>
> I temporarily lost remote access to the VM around 108 minutes into
> running the job (~17:50 in the logs) and the out of band console
> also became unresponsive to carriage returns. The machine's IP
> address still responded to ICMP ping, but attempts to open new TCP
> sockets to the SSH service never got a protocol version banner back.
> After about 10 minutes of that I went out to lunch but left
> everything untouched. To my excitement it was up and responding
> again when I returned.
>
> It appears from the logs that it runs well past the 120-minute mark
> where devstack-gate tries to kill the gate hook for its configured
> timeout. Somewhere around 165 minutes in (18:47) you can see the
> kernel out-of-memory killer starts to kick in and kill httpd and
> mysqld processes according to the syslog. Hopefully this is enough
> additional detail to get you a start at finding the root cause so
> that we can reenable your job. Let me know if there's anything else
> you need for this.
>
> [1] http://fungi.yuggoth.org/tmp/logs.tar
> --
> Jeremy Stanley
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-19 Thread Jeremy Stanley
On 2015-02-19 17:03:49 +0100 (+0100), Deepak Shetty wrote:
[...]
> For some reason we are seeing the centos7 glusterfs CI job getting
> aborted/ killed either by Java exception or the build getting
> aborted due to timeout.
[...]
> Hoping to root cause this soon and get the cinder-glusterfs CI job
> back online soon.

I manually reran the same commands this job runs on an identical
virtual machine and was able to reproduce some substantial
weirdness.

I temporarily lost remote access to the VM around 108 minutes into
running the job (~17:50 in the logs) and the out of band console
also became unresponsive to carriage returns. The machine's IP
address still responded to ICMP ping, but attempts to open new TCP
sockets to the SSH service never got a protocol version banner back.
After about 10 minutes of that I went out to lunch but left
everything untouched. To my excitement it was up and responding
again when I returned.

It appears from the logs that it runs well past the 120-minute mark
where devstack-gate tries to kill the gate hook for its configured
timeout. Somewhere around 165 minutes in (18:47) you can see the
kernel out-of-memory killer starts to kick in and kill httpd and
mysqld processes according to the syslog. Hopefully this is enough
additional detail to get you a start at finding the root cause so
that we can reenable your job. Let me know if there's anything else
you need for this.

[1] http://fungi.yuggoth.org/tmp/logs.tar
-- 
Jeremy Stanley

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev