Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-03-05 Thread Deepak Shetty
Update:

   Cinder - GlusterFS CI job (ubuntu based) was added as experimental (non
voting) to cinder project [1]
Its running successfully without any issue so far [2], [3]

We will monitor it for few days and if it continues to run fine, we will
propose a patch to make it check (voting)

[1]: https://review.openstack.org/160664
[2]: https://jenkins07.openstack.org/job/gate-tempest-dsvm-full-glusterfs/
[3]: https://jenkins02.openstack.org/job/gate-tempest-dsvm-full-glusterfs/

thanx,
deepak

On Fri, Feb 27, 2015 at 10:47 PM, Deepak Shetty dpkshe...@gmail.com wrote:



 On Fri, Feb 27, 2015 at 4:02 PM, Deepak Shetty dpkshe...@gmail.com
 wrote:



 On Wed, Feb 25, 2015 at 11:48 PM, Deepak Shetty dpkshe...@gmail.com
 wrote:



 On Wed, Feb 25, 2015 at 8:42 PM, Deepak Shetty dpkshe...@gmail.com
 wrote:



 On Wed, Feb 25, 2015 at 6:34 PM, Jeremy Stanley fu...@yuggoth.org
 wrote:

 On 2015-02-25 17:02:34 +0530 (+0530), Deepak Shetty wrote:
 [...]
  Run 2) We removed glusterfs backend, so Cinder was configured with
  the default storage backend i.e. LVM. We re-created the OOM here
  too
 
  So that proves that glusterfs doesn't cause it, as its happening
  without glusterfs too.

 Well, if you re-ran the job on the same VM then the second result is
 potentially contaminated. Luckily this hypothesis can be confirmed
 by running the second test on a fresh VM in Rackspace.


 Maybe true, but we did the same on hpcloud provider VM too and both time
 it ran successfully with glusterfs as the cinder backend. Also before
 starting
 the 2nd run, we did unstack and saw that free memory did go back to 5G+
 and then re-invoked your script, I believe the contamination could
 result in some
 additional testcase failures (which we did see) but shouldn't be
 related to
 whether system can OOM or not, since thats a runtime thing.

 I see that the VM is up again. We will execute the 2nd run afresh now
 and update
 here.


 Ran tempest with configured with default backend i.e. LVM and was able
 to recreate
 the OOM issue, so running tempest without gluster against a fresh VM
 reliably
 recreates the OOM issue, snip below from syslog.

 Feb 25 16:58:37 devstack-centos7-rax-dfw-979654 kernel: glance-api
 invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0

 Had a discussion with clarkb on IRC and given that F20 is discontinued,
 F21 has issues with tempest (under debug by ianw)
 and centos7 also has issues on rax (as evident from this thread), the
 only option left is to go with ubuntu based CI job, which
 BharatK is working on now.


 Quick Update:

 Cinder-GlusterFS CI job on ubuntu was added (
 https://review.openstack.org/159217)

 We ran it 3 times against our stackforge repo patch @
 https://review.openstack.org/159711
 and it works fine (2 testcase failures, which are expected and we're
 working towards fixing them)

 For the logs of the 3 experimental runs, look @

 http://logs.openstack.org/11/159711/1/experimental/gate-tempest-dsvm-full-glusterfs/

 Of the 3 jobs, 1 was schedued on rax and 2 on hpcloud, so its working
 nicely across
 the different cloud providers.


 Clarkb, Fungi,
   Given that the ubuntu job is stable, I would like to propose to add it
 as experimental to the
 openstack cinder while we work on fixing the 2 failed test cases in
 parallel

 thanx,
 deepak


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-27 Thread Deepak Shetty
On Fri, Feb 27, 2015 at 4:02 PM, Deepak Shetty dpkshe...@gmail.com wrote:



 On Wed, Feb 25, 2015 at 11:48 PM, Deepak Shetty dpkshe...@gmail.com
 wrote:



 On Wed, Feb 25, 2015 at 8:42 PM, Deepak Shetty dpkshe...@gmail.com
 wrote:



 On Wed, Feb 25, 2015 at 6:34 PM, Jeremy Stanley fu...@yuggoth.org
 wrote:

 On 2015-02-25 17:02:34 +0530 (+0530), Deepak Shetty wrote:
 [...]
  Run 2) We removed glusterfs backend, so Cinder was configured with
  the default storage backend i.e. LVM. We re-created the OOM here
  too
 
  So that proves that glusterfs doesn't cause it, as its happening
  without glusterfs too.

 Well, if you re-ran the job on the same VM then the second result is
 potentially contaminated. Luckily this hypothesis can be confirmed
 by running the second test on a fresh VM in Rackspace.


 Maybe true, but we did the same on hpcloud provider VM too and both time
 it ran successfully with glusterfs as the cinder backend. Also before
 starting
 the 2nd run, we did unstack and saw that free memory did go back to 5G+
 and then re-invoked your script, I believe the contamination could
 result in some
 additional testcase failures (which we did see) but shouldn't be related
 to
 whether system can OOM or not, since thats a runtime thing.

 I see that the VM is up again. We will execute the 2nd run afresh now
 and update
 here.


 Ran tempest with configured with default backend i.e. LVM and was able to
 recreate
 the OOM issue, so running tempest without gluster against a fresh VM
 reliably
 recreates the OOM issue, snip below from syslog.

 Feb 25 16:58:37 devstack-centos7-rax-dfw-979654 kernel: glance-api
 invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0

 Had a discussion with clarkb on IRC and given that F20 is discontinued,
 F21 has issues with tempest (under debug by ianw)
 and centos7 also has issues on rax (as evident from this thread), the
 only option left is to go with ubuntu based CI job, which
 BharatK is working on now.


 Quick Update:

 Cinder-GlusterFS CI job on ubuntu was added (
 https://review.openstack.org/159217)

 We ran it 3 times against our stackforge repo patch @
 https://review.openstack.org/159711
 and it works fine (2 testcase failures, which are expected and we're
 working towards fixing them)

 For the logs of the 3 experimental runs, look @

 http://logs.openstack.org/11/159711/1/experimental/gate-tempest-dsvm-full-glusterfs/

 Of the 3 jobs, 1 was schedued on rax and 2 on hpcloud, so its working
 nicely across
 the different cloud providers.


Clarkb, Fungi,
  Given that the ubuntu job is stable, I would like to propose to add it as
experimental to the
openstack cinder while we work on fixing the 2 failed test cases in parallel

thanx,
deepak
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-27 Thread Deepak Shetty
On Wed, Feb 25, 2015 at 11:48 PM, Deepak Shetty dpkshe...@gmail.com wrote:



 On Wed, Feb 25, 2015 at 8:42 PM, Deepak Shetty dpkshe...@gmail.com
 wrote:



 On Wed, Feb 25, 2015 at 6:34 PM, Jeremy Stanley fu...@yuggoth.org
 wrote:

 On 2015-02-25 17:02:34 +0530 (+0530), Deepak Shetty wrote:
 [...]
  Run 2) We removed glusterfs backend, so Cinder was configured with
  the default storage backend i.e. LVM. We re-created the OOM here
  too
 
  So that proves that glusterfs doesn't cause it, as its happening
  without glusterfs too.

 Well, if you re-ran the job on the same VM then the second result is
 potentially contaminated. Luckily this hypothesis can be confirmed
 by running the second test on a fresh VM in Rackspace.


 Maybe true, but we did the same on hpcloud provider VM too and both time
 it ran successfully with glusterfs as the cinder backend. Also before
 starting
 the 2nd run, we did unstack and saw that free memory did go back to 5G+
 and then re-invoked your script, I believe the contamination could result
 in some
 additional testcase failures (which we did see) but shouldn't be related
 to
 whether system can OOM or not, since thats a runtime thing.

 I see that the VM is up again. We will execute the 2nd run afresh now and
 update
 here.


 Ran tempest with configured with default backend i.e. LVM and was able to
 recreate
 the OOM issue, so running tempest without gluster against a fresh VM
 reliably
 recreates the OOM issue, snip below from syslog.

 Feb 25 16:58:37 devstack-centos7-rax-dfw-979654 kernel: glance-api invoked
 oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0

 Had a discussion with clarkb on IRC and given that F20 is discontinued,
 F21 has issues with tempest (under debug by ianw)
 and centos7 also has issues on rax (as evident from this thread), the only
 option left is to go with ubuntu based CI job, which
 BharatK is working on now.


Quick Update:

Cinder-GlusterFS CI job on ubuntu was added (
https://review.openstack.org/159217)

We ran it 3 times against our stackforge repo patch @
https://review.openstack.org/159711
and it works fine (2 testcase failures, which are expected and we're
working towards fixing them)

For the logs of the 3 experimental runs, look @
http://logs.openstack.org/11/159711/1/experimental/gate-tempest-dsvm-full-glusterfs/

Of the 3 jobs, 1 was schedued on rax and 2 on hpcloud, so its working
nicely across
the different cloud providers.

thanx,
deepak
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-26 Thread Clark Boylan


On Thu, Feb 26, 2015, at 03:03 AM, Deepak Shetty wrote:
 On Wed, Feb 25, 2015 at 6:11 AM, Jeremy Stanley fu...@yuggoth.org
 wrote:
 
  On 2015-02-25 01:02:07 +0530 (+0530), Bharat Kumar wrote:
  [...]
   After running 971 test cases VM inaccessible for 569 ticks
  [...]
 
  Glad you're able to reproduce it. For the record that is running
  their 8GB performance flavor with a CentOS 7 PVHVM base image. The
  steps to recreate are http://paste.openstack.org/show/181303/ as
  discussed in IRC (for the sake of others following along). I've held
  a similar worker in HPCloud (15.126.235.20) which is a 30GB flavor
  artifically limited to 8GB through a kernel boot parameter.
  Hopefully following the same steps there will help either confirm
  the issue isn't specific to running in one particular service
  provider, or will yield some useful difference which could help
  highlight the cause.
 
  Either way, once 104.239.136.99 and 15.126.235.20 are no longer
  needed, please let one of the infrastructure root admins know to
  delete them.
 
 
 You can delete these VMs, wil request if needed again
I have marked these VMs for deletion and should be gone shortly. The new
experimental job is in place so you can start testing that against your
plugin with `check experimental` comments.

Clark

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-26 Thread Deepak Shetty
On Wed, Feb 25, 2015 at 6:11 AM, Jeremy Stanley fu...@yuggoth.org wrote:

 On 2015-02-25 01:02:07 +0530 (+0530), Bharat Kumar wrote:
 [...]
  After running 971 test cases VM inaccessible for 569 ticks
 [...]

 Glad you're able to reproduce it. For the record that is running
 their 8GB performance flavor with a CentOS 7 PVHVM base image. The
 steps to recreate are http://paste.openstack.org/show/181303/ as
 discussed in IRC (for the sake of others following along). I've held
 a similar worker in HPCloud (15.126.235.20) which is a 30GB flavor
 artifically limited to 8GB through a kernel boot parameter.
 Hopefully following the same steps there will help either confirm
 the issue isn't specific to running in one particular service
 provider, or will yield some useful difference which could help
 highlight the cause.

 Either way, once 104.239.136.99 and 15.126.235.20 are no longer
 needed, please let one of the infrastructure root admins know to
 delete them.


You can delete these VMs, wil request if needed again

thanx,
deepak
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-25 Thread Jeremy Stanley
On 2015-02-25 17:02:34 +0530 (+0530), Deepak Shetty wrote:
[...]
 Run 2) We removed glusterfs backend, so Cinder was configured with
 the default storage backend i.e. LVM. We re-created the OOM here
 too
 
 So that proves that glusterfs doesn't cause it, as its happening
 without glusterfs too.

Well, if you re-ran the job on the same VM then the second result is
potentially contaminated. Luckily this hypothesis can be confirmed
by running the second test on a fresh VM in Rackspace.

 The VM (104.239.136.99) is now in such a bad shape that existing
 ssh sessions are no longer responding for a long long time now,
 tho' ping works. So need someone to help reboot/restart the VM so
 that we can collect the logs for records. Couldn't find anyone
 during apac TZ to get it reboot.
[...]

According to novaclient that instance was in a shutoff state, and
so I had to nova reboot --hard to get it running. Looks like it's
back up and reachable again now.

 So from the above we can conclude that the tests are running fine
 on hpcloud and not on rax provider. Since the OS (centos7) inside
 the VM across provider is same, this now boils down to some issue
 with rax provider VM + centos7 combination.

This certainly seems possible.

 Another data point I could gather is:
     The only other centos7 job we have is
 check-tempest-dsvm-centos7 and it does not run full tempest
 looking at the job's config it only runs smoke tests (also
 confirmed the same with Ian W) which i believe is a subset of
 tests only.

Correct, so if we confirm that we can't successfully run tempest
full on CentOS 7 in both of our providers yet, we should probably
think hard about the implications on yesterday's discussion as to
whether to set the smoke version gating on devstack and
devstack-gate changes.

 So that brings to the conclusion that probably cinder-glusterfs CI
 job (check-tempest-dsvm-full-glusterfs-centos7) is the first
 centos7 based job running full tempest tests in upstream CI and
 hence is the first to hit the issue, but on rax provider only

Entirely likely. As I mentioned last week, we don't yet have any
voting/gating jobs running on the platform as far as I can tell, so
it's still very much in an experimental stage.
-- 
Jeremy Stanley

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-25 Thread Deepak Shetty
On Wed, Feb 25, 2015 at 6:34 PM, Jeremy Stanley fu...@yuggoth.org wrote:

 On 2015-02-25 17:02:34 +0530 (+0530), Deepak Shetty wrote:
 [...]
  Run 2) We removed glusterfs backend, so Cinder was configured with
  the default storage backend i.e. LVM. We re-created the OOM here
  too
 
  So that proves that glusterfs doesn't cause it, as its happening
  without glusterfs too.

 Well, if you re-ran the job on the same VM then the second result is
 potentially contaminated. Luckily this hypothesis can be confirmed
 by running the second test on a fresh VM in Rackspace.


Maybe true, but we did the same on hpcloud provider VM too and both time
it ran successfully with glusterfs as the cinder backend. Also before
starting
the 2nd run, we did unstack and saw that free memory did go back to 5G+
and then re-invoked your script, I believe the contamination could result
in some
additional testcase failures (which we did see) but shouldn't be related to
whether system can OOM or not, since thats a runtime thing.

I see that the VM is up again. We will execute the 2nd run afresh now and
update
here.



  The VM (104.239.136.99) is now in such a bad shape that existing
  ssh sessions are no longer responding for a long long time now,
  tho' ping works. So need someone to help reboot/restart the VM so
  that we can collect the logs for records. Couldn't find anyone
  during apac TZ to get it reboot.
 [...]

 According to novaclient that instance was in a shutoff state, and
 so I had to nova reboot --hard to get it running. Looks like it's
 back up and reachable again now.


Cool, thanks!



  So from the above we can conclude that the tests are running fine
  on hpcloud and not on rax provider. Since the OS (centos7) inside
  the VM across provider is same, this now boils down to some issue
  with rax provider VM + centos7 combination.

 This certainly seems possible.

  Another data point I could gather is:
  The only other centos7 job we have is
  check-tempest-dsvm-centos7 and it does not run full tempest
  looking at the job's config it only runs smoke tests (also
  confirmed the same with Ian W) which i believe is a subset of
  tests only.

 Correct, so if we confirm that we can't successfully run tempest
 full on CentOS 7 in both of our providers yet, we should probably
 think hard about the implications on yesterday's discussion as to
 whether to set the smoke version gating on devstack and
 devstack-gate changes.

  So that brings to the conclusion that probably cinder-glusterfs CI
  job (check-tempest-dsvm-full-glusterfs-centos7) is the first
  centos7 based job running full tempest tests in upstream CI and
  hence is the first to hit the issue, but on rax provider only

 Entirely likely. As I mentioned last week, we don't yet have any
 voting/gating jobs running on the platform as far as I can tell, so
 it's still very much in an experimental stage.


So is there a way for a job to ask for hpcloud affinity, since thats where
our
job ran well (faster and only 2 failures, which were expected) ? I am not
sure
how easy and time consuming it would be to root cause why centos7 + rax
provider
is causing oom.

Alternatively do you recommend using some other OS as the base for our job
F20 or F21 or ubuntu ? I assume that there are other Jobs in rax provider
that
run on Fedora or Ubuntu with full tempest and don't OOM, would you know ?

thanx,
deepak
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-25 Thread Deepak Shetty
On Wed, Feb 25, 2015 at 6:11 AM, Jeremy Stanley fu...@yuggoth.org wrote:

 On 2015-02-25 01:02:07 +0530 (+0530), Bharat Kumar wrote:
 [...]
  After running 971 test cases VM inaccessible for 569 ticks
 [...]

 Glad you're able to reproduce it. For the record that is running
 their 8GB performance flavor with a CentOS 7 PVHVM base image. The


So we had 2 runs in total in the rax provider VM and below are the results:

Run 1) It failed and re-created the OOM. The setup had glusterfs as a
storage
backend for Cinder.

[deepakcs@deepakcs r6-jeremy-rax-vm]$ grep oom-killer
run1-w-gluster/logs/syslog.txt
Feb 24 18:41:08 devstack-centos7-rax-dfw-979654.slave.openstack.org kernel:
mysqld invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0

Run 2) We *removed glusterfs backend*, so Cinder was configured with the
default
storage backend i.e. LVM. *We re-created the OOM here too*

So that proves that glusterfs doesn't cause it, as its happening without
glusterfs too.
The VM (104.239.136.99) is now in such a bad shape that existing ssh
sessions
are no longer responding for a long long time now, tho' ping works. So need
someone to
help reboot/restart the VM so that we can collect the logs for records.
Couldn't find anyone
during apac TZ to get it reboot.

We managed to get the below grep to work after a long time from another
terminal
to prove that oom did happen for run2

bash-4.2$ sudo cat /var/log/messages| grep oom-killer
Feb 25 08:53:16 devstack-centos7-rax-dfw-979654 kernel: ntpd invoked
oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Feb 25 09:03:35 devstack-centos7-rax-dfw-979654 kernel: beam.smp invoked
oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Feb 25 09:57:28 devstack-centos7-rax-dfw-979654 kernel: mysqld invoked
oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Feb 25 10:40:38 devstack-centos7-rax-dfw-979654 kernel: mysqld invoked
oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0


steps to recreate are http://paste.openstack.org/show/181303/ as
 discussed in IRC (for the sake of others following along). I've held
 a similar worker in HPCloud (15.126.235.20) which is a 30GB flavor


We ran 2 runs in total in the hpcloud provider VM (and this time it was
setup correctly with 8g ram, as evident from /proc/meminfo as well as dstat
output)

Run1) It was successfull. The setup had glusterfs as a storage
backend for Cinder. Only 2 testcases failed, they were expected. No oom
happened.

[deepakcs@deepakcs r7-jeremy-hpcloud-vm]$ grep oom-killer
run1-w-gluster/logs/syslog.txt
[deepakcs@deepakcs r7-jeremy-hpcloud-vm]$

Run 2) Since run1 went fine, we enabled tempest volume backup testcases too
and ran again.
It was successfull and no oom happened.

[deepakcs@deepakcs r7-jeremy-hpcloud-vm]$ grep oom-killer
run2-w-gluster/logs/syslog.txt
[deepakcs@deepakcs r7-jeremy-hpcloud-vm]$


 artifically limited to 8GB through a kernel boot parameter.
 Hopefully following the same steps there will help either confirm
 the issue isn't specific to running in one particular service
 provider, or will yield some useful difference which could help
 highlight the cause.


So from the above we can conclude that the tests are running fine on hpcloud
and not on rax provider. Since the OS (centos7) inside the VM across
provider is same,
this now boils down to some issue with rax provider VM + centos7
combination.

Another data point I could gather is:
The only other centos7 job we have is check-tempest-dsvm-centos7 and it
does not run full tempest
looking at the job's config it only runs smoke tests (also confirmed the
same with Ian W) which i believe
is a subset of tests only.

So that brings to the conclusion that probably cinder-glusterfs CI job
(check-tempest-dsvm-full-glusterfs-centos7) is the first centos7
based job running full tempest tests in upstream CI and hence is the first
to hit the issue , but on rax provider only

thanx,
deepak
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-24 Thread Bharat Kumar

Ran the job manually on rax VM, provided by Jeremy. (Thank you Jeremy).

After running 971 test cases VM inaccessible for 569 ticks, then 
continues... (Look at the console.log [1])

And also have a look at dstat log. [2]

The summary is:
==
Totals
==
Ran: 1125 tests in 5835. sec.
 - Passed: 960
 - Skipped: 88
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 77
Sum of execute time for each test: 13603.6755 sec.


[1] https://etherpad.openstack.org/p/rax_console.txt
[2] https://etherpad.openstack.org/p/rax_dstat.log

On 02/24/2015 07:03 PM, Deepak Shetty wrote:
FWIW, we tried to run our job in a rax provider VM (provided by ianw 
from his personal account)
and we ran the tempest tests twice, but the OOM did not re-create. Of 
the 2 runs, one of the run
used the same PYTHONHASHSEED as we had in one of the failed runs, 
still no oom.


Jeremy graciously agreed to provide us 2 VMs , one each from rax and 
hpcloud provider

to see if provider platform has anything to do with it.

So we plan to run again wtih the VMs given from Jeremy , post which i 
will send

next update here.

thanx,
deepak


On Tue, Feb 24, 2015 at 4:50 AM, Jeremy Stanley fu...@yuggoth.org 
mailto:fu...@yuggoth.org wrote:


Due to an image setup bug (I have a fix proposed currently), I was
able to rerun this on a VM in HPCloud with 30GB memory and it
completed in about an hour with a couple of tempest tests failing.
Logs at: http://fungi.yuggoth.org/tmp/logs3.tar

Rerunning again on another 8GB Rackspace VM with the job timeout
increased to 5 hours, I was able to recreate the network
connectivity issues exhibited previously. The job itself seems to
have run for roughly 3 hours while failing 15 tests, and the worker
was mostly unreachable for a while at the end (I don't know exactly
how long) until around the time it completed. The OOM condition is
present this time too according to the logs, occurring right near
the end of the job. Collected logs are available at:
http://fungi.yuggoth.org/tmp/logs4.tar

Given the comparison between these two runs, I suspect this is
either caused by memory constraints or block device I/O performance
differences (or perhaps an unhappy combination of the two).
Hopefully a close review of the logs will indicate which.
--
Jeremy Stanley

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


--
Warm Regards,
Bharat Kumar Kobagana
Software Engineer
OpenStack Storage – RedHat India
Mobile - +91 9949278005

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-24 Thread Deepak Shetty
FWIW, we tried to run our job in a rax provider VM (provided by ianw from
his personal account)
and we ran the tempest tests twice, but the OOM did not re-create. Of the 2
runs, one of the run
used the same PYTHONHASHSEED as we had in one of the failed runs, still no
oom.

Jeremy graciously agreed to provide us 2 VMs , one each from rax and
hpcloud provider
to see if provider platform has anything to do with it.

So we plan to run again wtih the VMs given from Jeremy , post which i will
send
next update here.

thanx,
deepak


On Tue, Feb 24, 2015 at 4:50 AM, Jeremy Stanley fu...@yuggoth.org wrote:

 Due to an image setup bug (I have a fix proposed currently), I was
 able to rerun this on a VM in HPCloud with 30GB memory and it
 completed in about an hour with a couple of tempest tests failing.
 Logs at: http://fungi.yuggoth.org/tmp/logs3.tar

 Rerunning again on another 8GB Rackspace VM with the job timeout
 increased to 5 hours, I was able to recreate the network
 connectivity issues exhibited previously. The job itself seems to
 have run for roughly 3 hours while failing 15 tests, and the worker
 was mostly unreachable for a while at the end (I don't know exactly
 how long) until around the time it completed. The OOM condition is
 present this time too according to the logs, occurring right near
 the end of the job. Collected logs are available at:
 http://fungi.yuggoth.org/tmp/logs4.tar

 Given the comparison between these two runs, I suspect this is
 either caused by memory constraints or block device I/O performance
 differences (or perhaps an unhappy combination of the two).
 Hopefully a close review of the logs will indicate which.
 --
 Jeremy Stanley

 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-24 Thread Daniel P. Berrange
On Fri, Feb 20, 2015 at 10:49:29AM -0800, Joe Gordon wrote:
 On Fri, Feb 20, 2015 at 7:29 AM, Deepak Shetty dpkshe...@gmail.com wrote:
 
  Hi Jeremy,
Couldn't find anything strong in the logs to back the reason for OOM.
  At the time OOM happens, mysqld and java processes have the most RAM hence
  OOM selects mysqld (4.7G) to be killed.
 
  From a glusterfs backend perspective, i haven't found anything suspicious,
  and we don't have the logs of glusterfs (which is typically in
  /var/log/glusterfs) so can't delve inside glusterfs too much :(
 
  BharatK (in CC) also tried to re-create the issue in local VM setup, but
  it hasn't yet!
 
  Having said that,* we do know* that we started seeing this issue after we
  enabled the nova-assisted-snapshot tests (by changing nova' s policy.json
  to enable non-admin to create hyp-assisted snaps). We think that enabling
  online snaps might have added to the number of tests and memory load 
  thats the only clue we have as of now!
 
 
 It looks like OOM killer hit while qemu was busy and during
 a ServerRescueTest. Maybe libvirt logs would be useful as well?
 
 And I don't see any tempest tests calling assisted-volume-snapshots
 
 Also this looks odd: Feb 19 18:47:16
 devstack-centos7-rax-iad-916633.slave.openstack.org libvirtd[3753]: missing
 __com.redhat_reason in disk io error event

So that specific error message is harmless - the __com.redhat_reason field
is nothing important from OpenStack's POV.

However, it is interesting that QEMU is seeing an I/O error in the first
place. This occurs when you have a grow on demand file, and the underlying
storage is full, so unable to allocate more blocks to cope with a guest
write. It can also occur if the underlying storage has a fatal I/O problem,
eg dead sector in harddisk, or the some equivalent.

IOW, I'd not expect to see any I/O errors raised from OpenStack in a normal
scenario. So this is something to consider investigating.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-24 Thread Jeremy Stanley
On 2015-02-25 01:02:07 +0530 (+0530), Bharat Kumar wrote:
[...]
 After running 971 test cases VM inaccessible for 569 ticks
[...]

Glad you're able to reproduce it. For the record that is running
their 8GB performance flavor with a CentOS 7 PVHVM base image. The
steps to recreate are http://paste.openstack.org/show/181303/ as
discussed in IRC (for the sake of others following along). I've held
a similar worker in HPCloud (15.126.235.20) which is a 30GB flavor
artifically limited to 8GB through a kernel boot parameter.
Hopefully following the same steps there will help either confirm
the issue isn't specific to running in one particular service
provider, or will yield some useful difference which could help
highlight the cause.

Either way, once 104.239.136.99 and 15.126.235.20 are no longer
needed, please let one of the infrastructure root admins know to
delete them.
-- 
Jeremy Stanley

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-20 Thread Deepak Shetty
On Feb 21, 2015 12:20 AM, Jeremy Stanley fu...@yuggoth.org wrote:

 On 2015-02-20 16:29:31 +0100 (+0100), Deepak Shetty wrote:
  Couldn't find anything strong in the logs to back the reason for
  OOM. At the time OOM happens, mysqld and java processes have the
  most RAM hence OOM selects mysqld (4.7G) to be killed.
 [...]

 Today I reran it after you rolled back some additional tests, and it
 runs for about 117 minutes before the OOM killer shoots nova-compute
 in the head. At your request I've added /var/log/glusterfs into the
 tarball this time: http://fungi.yuggoth.org/tmp/logs2.tar

Thanks jeremy, can we get ssh access to one of these env to debug?

Thanks
Deepak

 --
 Jeremy Stanley

 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-20 Thread Deepak Shetty
On Feb 21, 2015 12:26 AM, Joe Gordon joe.gord...@gmail.com wrote:



 On Fri, Feb 20, 2015 at 7:29 AM, Deepak Shetty dpkshe...@gmail.com
wrote:

 Hi Jeremy,
   Couldn't find anything strong in the logs to back the reason for OOM.
 At the time OOM happens, mysqld and java processes have the most RAM
hence OOM selects mysqld (4.7G) to be killed.

 From a glusterfs backend perspective, i haven't found anything
suspicious, and we don't have the logs of glusterfs (which is typically in
/var/log/glusterfs) so can't delve inside glusterfs too much :(

 BharatK (in CC) also tried to re-create the issue in local VM setup, but
it hasn't yet!

 Having said that, we do know that we started seeing this issue after we
enabled the nova-assisted-snapshot tests (by changing nova' s policy.json
to enable non-admin to create hyp-assisted snaps). We think that enabling
online snaps might have added to the number of tests and memory load 
thats the only clue we have as of now!


 It looks like OOM killer hit while qemu was busy and during
a ServerRescueTest. Maybe libvirt logs would be useful as well?

Thanks for the data point, will look at this test to understand more what's
happening


 And I don't see any tempest tests calling assisted-volume-snapshots

Maybe it still hasn't reached to it yet.

Thanks
Deepak


 Also this looks odd: Feb 19 18:47:16
devstack-centos7-rax-iad-916633.slave.openstack.org libvirtd[3753]: missing
__com.redhat_reason in disk io error event



 So :

   1) BharatK  has merged the patch (
https://review.openstack.org/#/c/157707/ ) to revert the policy.json in the
glusterfs job. So no more nova-assisted-snap tests.

   2) We also are increasing the timeout of our job in patch (
https://review.openstack.org/#/c/157835/1 ) so that we can get a full run
without timeouts to do a good analysis of the logs (logs are not posted if
the job times out)

 Can you please re-enable our job, so that we can confirm that disabling
online snap TCs is helping the issue, which if it does, can help us narrow
down the issue.

 We also plan to monitor  debug over the weekend hence having the job
enabled can help us a lot.

 thanx,
 deepak


 On Thu, Feb 19, 2015 at 10:37 PM, Jeremy Stanley fu...@yuggoth.org
wrote:

 On 2015-02-19 17:03:49 +0100 (+0100), Deepak Shetty wrote:
 [...]
  For some reason we are seeing the centos7 glusterfs CI job getting
  aborted/ killed either by Java exception or the build getting
  aborted due to timeout.
 [...]
  Hoping to root cause this soon and get the cinder-glusterfs CI job
  back online soon.

 I manually reran the same commands this job runs on an identical
 virtual machine and was able to reproduce some substantial
 weirdness.

 I temporarily lost remote access to the VM around 108 minutes into
 running the job (~17:50 in the logs) and the out of band console
 also became unresponsive to carriage returns. The machine's IP
 address still responded to ICMP ping, but attempts to open new TCP
 sockets to the SSH service never got a protocol version banner back.
 After about 10 minutes of that I went out to lunch but left
 everything untouched. To my excitement it was up and responding
 again when I returned.

 It appears from the logs that it runs well past the 120-minute mark
 where devstack-gate tries to kill the gate hook for its configured
 timeout. Somewhere around 165 minutes in (18:47) you can see the
 kernel out-of-memory killer starts to kick in and kill httpd and
 mysqld processes according to the syslog. Hopefully this is enough
 additional detail to get you a start at finding the root cause so
 that we can reenable your job. Let me know if there's anything else
 you need for this.

 [1] http://fungi.yuggoth.org/tmp/logs.tar
 --
 Jeremy Stanley


__
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-20 Thread Deepak Shetty
Hi Jeremy,
  Couldn't find anything strong in the logs to back the reason for OOM.
At the time OOM happens, mysqld and java processes have the most RAM hence
OOM selects mysqld (4.7G) to be killed.

From a glusterfs backend perspective, i haven't found anything suspicious,
and we don't have the logs of glusterfs (which is typically in
/var/log/glusterfs) so can't delve inside glusterfs too much :(

BharatK (in CC) also tried to re-create the issue in local VM setup, but it
hasn't yet!

Having said that,* we do know* that we started seeing this issue after we
enabled the nova-assisted-snapshot tests (by changing nova' s policy.json
to enable non-admin to create hyp-assisted snaps). We think that enabling
online snaps might have added to the number of tests and memory load 
thats the only clue we have as of now!

So :

  1) BharatK  has merged the patch (
https://review.openstack.org/#/c/157707/ ) to revert the policy.json in the
glusterfs job. So no more nova-assisted-snap tests.

  2) We also are increasing the timeout of our job in patch (
https://review.openstack.org/#/c/157835/1 ) so that we can get a full run
without timeouts to do a good analysis of the logs (logs are not posted if
the job times out)

Can you please re-enable our job, so that we can confirm that disabling
online snap TCs is helping the issue, which if it does, can help us narrow
down the issue.

We also plan to monitor  debug over the weekend hence having the job
enabled can help us a lot.

thanx,
deepak


On Thu, Feb 19, 2015 at 10:37 PM, Jeremy Stanley fu...@yuggoth.org wrote:

 On 2015-02-19 17:03:49 +0100 (+0100), Deepak Shetty wrote:
 [...]
  For some reason we are seeing the centos7 glusterfs CI job getting
  aborted/ killed either by Java exception or the build getting
  aborted due to timeout.
 [...]
  Hoping to root cause this soon and get the cinder-glusterfs CI job
  back online soon.

 I manually reran the same commands this job runs on an identical
 virtual machine and was able to reproduce some substantial
 weirdness.

 I temporarily lost remote access to the VM around 108 minutes into
 running the job (~17:50 in the logs) and the out of band console
 also became unresponsive to carriage returns. The machine's IP
 address still responded to ICMP ping, but attempts to open new TCP
 sockets to the SSH service never got a protocol version banner back.
 After about 10 minutes of that I went out to lunch but left
 everything untouched. To my excitement it was up and responding
 again when I returned.

 It appears from the logs that it runs well past the 120-minute mark
 where devstack-gate tries to kill the gate hook for its configured
 timeout. Somewhere around 165 minutes in (18:47) you can see the
 kernel out-of-memory killer starts to kick in and kill httpd and
 mysqld processes according to the syslog. Hopefully this is enough
 additional detail to get you a start at finding the root cause so
 that we can reenable your job. Let me know if there's anything else
 you need for this.

 [1] http://fungi.yuggoth.org/tmp/logs.tar
 --
 Jeremy Stanley

 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-20 Thread Joe Gordon
On Fri, Feb 20, 2015 at 7:29 AM, Deepak Shetty dpkshe...@gmail.com wrote:

 Hi Jeremy,
   Couldn't find anything strong in the logs to back the reason for OOM.
 At the time OOM happens, mysqld and java processes have the most RAM hence
 OOM selects mysqld (4.7G) to be killed.

 From a glusterfs backend perspective, i haven't found anything suspicious,
 and we don't have the logs of glusterfs (which is typically in
 /var/log/glusterfs) so can't delve inside glusterfs too much :(

 BharatK (in CC) also tried to re-create the issue in local VM setup, but
 it hasn't yet!

 Having said that,* we do know* that we started seeing this issue after we
 enabled the nova-assisted-snapshot tests (by changing nova' s policy.json
 to enable non-admin to create hyp-assisted snaps). We think that enabling
 online snaps might have added to the number of tests and memory load 
 thats the only clue we have as of now!


It looks like OOM killer hit while qemu was busy and during
a ServerRescueTest. Maybe libvirt logs would be useful as well?

And I don't see any tempest tests calling assisted-volume-snapshots

Also this looks odd: Feb 19 18:47:16
devstack-centos7-rax-iad-916633.slave.openstack.org libvirtd[3753]: missing
__com.redhat_reason in disk io error event



 So :

   1) BharatK  has merged the patch (
 https://review.openstack.org/#/c/157707/ ) to revert the policy.json in
 the glusterfs job. So no more nova-assisted-snap tests.

   2) We also are increasing the timeout of our job in patch (
 https://review.openstack.org/#/c/157835/1 ) so that we can get a full run
 without timeouts to do a good analysis of the logs (logs are not posted if
 the job times out)

 Can you please re-enable our job, so that we can confirm that disabling
 online snap TCs is helping the issue, which if it does, can help us narrow
 down the issue.

 We also plan to monitor  debug over the weekend hence having the job
 enabled can help us a lot.

 thanx,
 deepak


 On Thu, Feb 19, 2015 at 10:37 PM, Jeremy Stanley fu...@yuggoth.org
 wrote:

 On 2015-02-19 17:03:49 +0100 (+0100), Deepak Shetty wrote:
 [...]
  For some reason we are seeing the centos7 glusterfs CI job getting
  aborted/ killed either by Java exception or the build getting
  aborted due to timeout.
 [...]
  Hoping to root cause this soon and get the cinder-glusterfs CI job
  back online soon.

 I manually reran the same commands this job runs on an identical
 virtual machine and was able to reproduce some substantial
 weirdness.

 I temporarily lost remote access to the VM around 108 minutes into
 running the job (~17:50 in the logs) and the out of band console
 also became unresponsive to carriage returns. The machine's IP
 address still responded to ICMP ping, but attempts to open new TCP
 sockets to the SSH service never got a protocol version banner back.
 After about 10 minutes of that I went out to lunch but left
 everything untouched. To my excitement it was up and responding
 again when I returned.

 It appears from the logs that it runs well past the 120-minute mark
 where devstack-gate tries to kill the gate hook for its configured
 timeout. Somewhere around 165 minutes in (18:47) you can see the
 kernel out-of-memory killer starts to kick in and kill httpd and
 mysqld processes according to the syslog. Hopefully this is enough
 additional detail to get you a start at finding the root cause so
 that we can reenable your job. Let me know if there's anything else
 you need for this.

 [1] http://fungi.yuggoth.org/tmp/logs.tar
 --
 Jeremy Stanley

 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe:
 openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-20 Thread Jeremy Stanley
On 2015-02-20 16:29:31 +0100 (+0100), Deepak Shetty wrote:
 Couldn't find anything strong in the logs to back the reason for
 OOM. At the time OOM happens, mysqld and java processes have the
 most RAM hence OOM selects mysqld (4.7G) to be killed.
[...]

Today I reran it after you rolled back some additional tests, and it
runs for about 117 minutes before the OOM killer shoots nova-compute
in the head. At your request I've added /var/log/glusterfs into the
tarball this time: http://fungi.yuggoth.org/tmp/logs2.tar
-- 
Jeremy Stanley

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

2015-02-19 Thread Jeremy Stanley
On 2015-02-19 17:03:49 +0100 (+0100), Deepak Shetty wrote:
[...]
 For some reason we are seeing the centos7 glusterfs CI job getting
 aborted/ killed either by Java exception or the build getting
 aborted due to timeout.
[...]
 Hoping to root cause this soon and get the cinder-glusterfs CI job
 back online soon.

I manually reran the same commands this job runs on an identical
virtual machine and was able to reproduce some substantial
weirdness.

I temporarily lost remote access to the VM around 108 minutes into
running the job (~17:50 in the logs) and the out of band console
also became unresponsive to carriage returns. The machine's IP
address still responded to ICMP ping, but attempts to open new TCP
sockets to the SSH service never got a protocol version banner back.
After about 10 minutes of that I went out to lunch but left
everything untouched. To my excitement it was up and responding
again when I returned.

It appears from the logs that it runs well past the 120-minute mark
where devstack-gate tries to kill the gate hook for its configured
timeout. Somewhere around 165 minutes in (18:47) you can see the
kernel out-of-memory killer starts to kick in and kill httpd and
mysqld processes according to the syslog. Hopefully this is enough
additional detail to get you a start at finding the root cause so
that we can reenable your job. Let me know if there's anything else
you need for this.

[1] http://fungi.yuggoth.org/tmp/logs.tar
-- 
Jeremy Stanley

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev