Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures
Update: Cinder - GlusterFS CI job (ubuntu based) was added as experimental (non voting) to cinder project [1] Its running successfully without any issue so far [2], [3] We will monitor it for few days and if it continues to run fine, we will propose a patch to make it check (voting) [1]: https://review.openstack.org/160664 [2]: https://jenkins07.openstack.org/job/gate-tempest-dsvm-full-glusterfs/ [3]: https://jenkins02.openstack.org/job/gate-tempest-dsvm-full-glusterfs/ thanx, deepak On Fri, Feb 27, 2015 at 10:47 PM, Deepak Shetty dpkshe...@gmail.com wrote: On Fri, Feb 27, 2015 at 4:02 PM, Deepak Shetty dpkshe...@gmail.com wrote: On Wed, Feb 25, 2015 at 11:48 PM, Deepak Shetty dpkshe...@gmail.com wrote: On Wed, Feb 25, 2015 at 8:42 PM, Deepak Shetty dpkshe...@gmail.com wrote: On Wed, Feb 25, 2015 at 6:34 PM, Jeremy Stanley fu...@yuggoth.org wrote: On 2015-02-25 17:02:34 +0530 (+0530), Deepak Shetty wrote: [...] Run 2) We removed glusterfs backend, so Cinder was configured with the default storage backend i.e. LVM. We re-created the OOM here too So that proves that glusterfs doesn't cause it, as its happening without glusterfs too. Well, if you re-ran the job on the same VM then the second result is potentially contaminated. Luckily this hypothesis can be confirmed by running the second test on a fresh VM in Rackspace. Maybe true, but we did the same on hpcloud provider VM too and both time it ran successfully with glusterfs as the cinder backend. Also before starting the 2nd run, we did unstack and saw that free memory did go back to 5G+ and then re-invoked your script, I believe the contamination could result in some additional testcase failures (which we did see) but shouldn't be related to whether system can OOM or not, since thats a runtime thing. I see that the VM is up again. We will execute the 2nd run afresh now and update here. Ran tempest with configured with default backend i.e. LVM and was able to recreate the OOM issue, so running tempest without gluster against a fresh VM reliably recreates the OOM issue, snip below from syslog. Feb 25 16:58:37 devstack-centos7-rax-dfw-979654 kernel: glance-api invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Had a discussion with clarkb on IRC and given that F20 is discontinued, F21 has issues with tempest (under debug by ianw) and centos7 also has issues on rax (as evident from this thread), the only option left is to go with ubuntu based CI job, which BharatK is working on now. Quick Update: Cinder-GlusterFS CI job on ubuntu was added ( https://review.openstack.org/159217) We ran it 3 times against our stackforge repo patch @ https://review.openstack.org/159711 and it works fine (2 testcase failures, which are expected and we're working towards fixing them) For the logs of the 3 experimental runs, look @ http://logs.openstack.org/11/159711/1/experimental/gate-tempest-dsvm-full-glusterfs/ Of the 3 jobs, 1 was schedued on rax and 2 on hpcloud, so its working nicely across the different cloud providers. Clarkb, Fungi, Given that the ubuntu job is stable, I would like to propose to add it as experimental to the openstack cinder while we work on fixing the 2 failed test cases in parallel thanx, deepak __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures
On Fri, Feb 27, 2015 at 4:02 PM, Deepak Shetty dpkshe...@gmail.com wrote: On Wed, Feb 25, 2015 at 11:48 PM, Deepak Shetty dpkshe...@gmail.com wrote: On Wed, Feb 25, 2015 at 8:42 PM, Deepak Shetty dpkshe...@gmail.com wrote: On Wed, Feb 25, 2015 at 6:34 PM, Jeremy Stanley fu...@yuggoth.org wrote: On 2015-02-25 17:02:34 +0530 (+0530), Deepak Shetty wrote: [...] Run 2) We removed glusterfs backend, so Cinder was configured with the default storage backend i.e. LVM. We re-created the OOM here too So that proves that glusterfs doesn't cause it, as its happening without glusterfs too. Well, if you re-ran the job on the same VM then the second result is potentially contaminated. Luckily this hypothesis can be confirmed by running the second test on a fresh VM in Rackspace. Maybe true, but we did the same on hpcloud provider VM too and both time it ran successfully with glusterfs as the cinder backend. Also before starting the 2nd run, we did unstack and saw that free memory did go back to 5G+ and then re-invoked your script, I believe the contamination could result in some additional testcase failures (which we did see) but shouldn't be related to whether system can OOM or not, since thats a runtime thing. I see that the VM is up again. We will execute the 2nd run afresh now and update here. Ran tempest with configured with default backend i.e. LVM and was able to recreate the OOM issue, so running tempest without gluster against a fresh VM reliably recreates the OOM issue, snip below from syslog. Feb 25 16:58:37 devstack-centos7-rax-dfw-979654 kernel: glance-api invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Had a discussion with clarkb on IRC and given that F20 is discontinued, F21 has issues with tempest (under debug by ianw) and centos7 also has issues on rax (as evident from this thread), the only option left is to go with ubuntu based CI job, which BharatK is working on now. Quick Update: Cinder-GlusterFS CI job on ubuntu was added ( https://review.openstack.org/159217) We ran it 3 times against our stackforge repo patch @ https://review.openstack.org/159711 and it works fine (2 testcase failures, which are expected and we're working towards fixing them) For the logs of the 3 experimental runs, look @ http://logs.openstack.org/11/159711/1/experimental/gate-tempest-dsvm-full-glusterfs/ Of the 3 jobs, 1 was schedued on rax and 2 on hpcloud, so its working nicely across the different cloud providers. Clarkb, Fungi, Given that the ubuntu job is stable, I would like to propose to add it as experimental to the openstack cinder while we work on fixing the 2 failed test cases in parallel thanx, deepak __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures
On Wed, Feb 25, 2015 at 11:48 PM, Deepak Shetty dpkshe...@gmail.com wrote: On Wed, Feb 25, 2015 at 8:42 PM, Deepak Shetty dpkshe...@gmail.com wrote: On Wed, Feb 25, 2015 at 6:34 PM, Jeremy Stanley fu...@yuggoth.org wrote: On 2015-02-25 17:02:34 +0530 (+0530), Deepak Shetty wrote: [...] Run 2) We removed glusterfs backend, so Cinder was configured with the default storage backend i.e. LVM. We re-created the OOM here too So that proves that glusterfs doesn't cause it, as its happening without glusterfs too. Well, if you re-ran the job on the same VM then the second result is potentially contaminated. Luckily this hypothesis can be confirmed by running the second test on a fresh VM in Rackspace. Maybe true, but we did the same on hpcloud provider VM too and both time it ran successfully with glusterfs as the cinder backend. Also before starting the 2nd run, we did unstack and saw that free memory did go back to 5G+ and then re-invoked your script, I believe the contamination could result in some additional testcase failures (which we did see) but shouldn't be related to whether system can OOM or not, since thats a runtime thing. I see that the VM is up again. We will execute the 2nd run afresh now and update here. Ran tempest with configured with default backend i.e. LVM and was able to recreate the OOM issue, so running tempest without gluster against a fresh VM reliably recreates the OOM issue, snip below from syslog. Feb 25 16:58:37 devstack-centos7-rax-dfw-979654 kernel: glance-api invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Had a discussion with clarkb on IRC and given that F20 is discontinued, F21 has issues with tempest (under debug by ianw) and centos7 also has issues on rax (as evident from this thread), the only option left is to go with ubuntu based CI job, which BharatK is working on now. Quick Update: Cinder-GlusterFS CI job on ubuntu was added ( https://review.openstack.org/159217) We ran it 3 times against our stackforge repo patch @ https://review.openstack.org/159711 and it works fine (2 testcase failures, which are expected and we're working towards fixing them) For the logs of the 3 experimental runs, look @ http://logs.openstack.org/11/159711/1/experimental/gate-tempest-dsvm-full-glusterfs/ Of the 3 jobs, 1 was schedued on rax and 2 on hpcloud, so its working nicely across the different cloud providers. thanx, deepak __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures
On Thu, Feb 26, 2015, at 03:03 AM, Deepak Shetty wrote: On Wed, Feb 25, 2015 at 6:11 AM, Jeremy Stanley fu...@yuggoth.org wrote: On 2015-02-25 01:02:07 +0530 (+0530), Bharat Kumar wrote: [...] After running 971 test cases VM inaccessible for 569 ticks [...] Glad you're able to reproduce it. For the record that is running their 8GB performance flavor with a CentOS 7 PVHVM base image. The steps to recreate are http://paste.openstack.org/show/181303/ as discussed in IRC (for the sake of others following along). I've held a similar worker in HPCloud (15.126.235.20) which is a 30GB flavor artifically limited to 8GB through a kernel boot parameter. Hopefully following the same steps there will help either confirm the issue isn't specific to running in one particular service provider, or will yield some useful difference which could help highlight the cause. Either way, once 104.239.136.99 and 15.126.235.20 are no longer needed, please let one of the infrastructure root admins know to delete them. You can delete these VMs, wil request if needed again I have marked these VMs for deletion and should be gone shortly. The new experimental job is in place so you can start testing that against your plugin with `check experimental` comments. Clark __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures
On Wed, Feb 25, 2015 at 6:11 AM, Jeremy Stanley fu...@yuggoth.org wrote: On 2015-02-25 01:02:07 +0530 (+0530), Bharat Kumar wrote: [...] After running 971 test cases VM inaccessible for 569 ticks [...] Glad you're able to reproduce it. For the record that is running their 8GB performance flavor with a CentOS 7 PVHVM base image. The steps to recreate are http://paste.openstack.org/show/181303/ as discussed in IRC (for the sake of others following along). I've held a similar worker in HPCloud (15.126.235.20) which is a 30GB flavor artifically limited to 8GB through a kernel boot parameter. Hopefully following the same steps there will help either confirm the issue isn't specific to running in one particular service provider, or will yield some useful difference which could help highlight the cause. Either way, once 104.239.136.99 and 15.126.235.20 are no longer needed, please let one of the infrastructure root admins know to delete them. You can delete these VMs, wil request if needed again thanx, deepak __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures
On 2015-02-25 17:02:34 +0530 (+0530), Deepak Shetty wrote: [...] Run 2) We removed glusterfs backend, so Cinder was configured with the default storage backend i.e. LVM. We re-created the OOM here too So that proves that glusterfs doesn't cause it, as its happening without glusterfs too. Well, if you re-ran the job on the same VM then the second result is potentially contaminated. Luckily this hypothesis can be confirmed by running the second test on a fresh VM in Rackspace. The VM (104.239.136.99) is now in such a bad shape that existing ssh sessions are no longer responding for a long long time now, tho' ping works. So need someone to help reboot/restart the VM so that we can collect the logs for records. Couldn't find anyone during apac TZ to get it reboot. [...] According to novaclient that instance was in a shutoff state, and so I had to nova reboot --hard to get it running. Looks like it's back up and reachable again now. So from the above we can conclude that the tests are running fine on hpcloud and not on rax provider. Since the OS (centos7) inside the VM across provider is same, this now boils down to some issue with rax provider VM + centos7 combination. This certainly seems possible. Another data point I could gather is: The only other centos7 job we have is check-tempest-dsvm-centos7 and it does not run full tempest looking at the job's config it only runs smoke tests (also confirmed the same with Ian W) which i believe is a subset of tests only. Correct, so if we confirm that we can't successfully run tempest full on CentOS 7 in both of our providers yet, we should probably think hard about the implications on yesterday's discussion as to whether to set the smoke version gating on devstack and devstack-gate changes. So that brings to the conclusion that probably cinder-glusterfs CI job (check-tempest-dsvm-full-glusterfs-centos7) is the first centos7 based job running full tempest tests in upstream CI and hence is the first to hit the issue, but on rax provider only Entirely likely. As I mentioned last week, we don't yet have any voting/gating jobs running on the platform as far as I can tell, so it's still very much in an experimental stage. -- Jeremy Stanley __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures
On Wed, Feb 25, 2015 at 6:34 PM, Jeremy Stanley fu...@yuggoth.org wrote: On 2015-02-25 17:02:34 +0530 (+0530), Deepak Shetty wrote: [...] Run 2) We removed glusterfs backend, so Cinder was configured with the default storage backend i.e. LVM. We re-created the OOM here too So that proves that glusterfs doesn't cause it, as its happening without glusterfs too. Well, if you re-ran the job on the same VM then the second result is potentially contaminated. Luckily this hypothesis can be confirmed by running the second test on a fresh VM in Rackspace. Maybe true, but we did the same on hpcloud provider VM too and both time it ran successfully with glusterfs as the cinder backend. Also before starting the 2nd run, we did unstack and saw that free memory did go back to 5G+ and then re-invoked your script, I believe the contamination could result in some additional testcase failures (which we did see) but shouldn't be related to whether system can OOM or not, since thats a runtime thing. I see that the VM is up again. We will execute the 2nd run afresh now and update here. The VM (104.239.136.99) is now in such a bad shape that existing ssh sessions are no longer responding for a long long time now, tho' ping works. So need someone to help reboot/restart the VM so that we can collect the logs for records. Couldn't find anyone during apac TZ to get it reboot. [...] According to novaclient that instance was in a shutoff state, and so I had to nova reboot --hard to get it running. Looks like it's back up and reachable again now. Cool, thanks! So from the above we can conclude that the tests are running fine on hpcloud and not on rax provider. Since the OS (centos7) inside the VM across provider is same, this now boils down to some issue with rax provider VM + centos7 combination. This certainly seems possible. Another data point I could gather is: The only other centos7 job we have is check-tempest-dsvm-centos7 and it does not run full tempest looking at the job's config it only runs smoke tests (also confirmed the same with Ian W) which i believe is a subset of tests only. Correct, so if we confirm that we can't successfully run tempest full on CentOS 7 in both of our providers yet, we should probably think hard about the implications on yesterday's discussion as to whether to set the smoke version gating on devstack and devstack-gate changes. So that brings to the conclusion that probably cinder-glusterfs CI job (check-tempest-dsvm-full-glusterfs-centos7) is the first centos7 based job running full tempest tests in upstream CI and hence is the first to hit the issue, but on rax provider only Entirely likely. As I mentioned last week, we don't yet have any voting/gating jobs running on the platform as far as I can tell, so it's still very much in an experimental stage. So is there a way for a job to ask for hpcloud affinity, since thats where our job ran well (faster and only 2 failures, which were expected) ? I am not sure how easy and time consuming it would be to root cause why centos7 + rax provider is causing oom. Alternatively do you recommend using some other OS as the base for our job F20 or F21 or ubuntu ? I assume that there are other Jobs in rax provider that run on Fedora or Ubuntu with full tempest and don't OOM, would you know ? thanx, deepak __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures
On Wed, Feb 25, 2015 at 6:11 AM, Jeremy Stanley fu...@yuggoth.org wrote: On 2015-02-25 01:02:07 +0530 (+0530), Bharat Kumar wrote: [...] After running 971 test cases VM inaccessible for 569 ticks [...] Glad you're able to reproduce it. For the record that is running their 8GB performance flavor with a CentOS 7 PVHVM base image. The So we had 2 runs in total in the rax provider VM and below are the results: Run 1) It failed and re-created the OOM. The setup had glusterfs as a storage backend for Cinder. [deepakcs@deepakcs r6-jeremy-rax-vm]$ grep oom-killer run1-w-gluster/logs/syslog.txt Feb 24 18:41:08 devstack-centos7-rax-dfw-979654.slave.openstack.org kernel: mysqld invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Run 2) We *removed glusterfs backend*, so Cinder was configured with the default storage backend i.e. LVM. *We re-created the OOM here too* So that proves that glusterfs doesn't cause it, as its happening without glusterfs too. The VM (104.239.136.99) is now in such a bad shape that existing ssh sessions are no longer responding for a long long time now, tho' ping works. So need someone to help reboot/restart the VM so that we can collect the logs for records. Couldn't find anyone during apac TZ to get it reboot. We managed to get the below grep to work after a long time from another terminal to prove that oom did happen for run2 bash-4.2$ sudo cat /var/log/messages| grep oom-killer Feb 25 08:53:16 devstack-centos7-rax-dfw-979654 kernel: ntpd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Feb 25 09:03:35 devstack-centos7-rax-dfw-979654 kernel: beam.smp invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Feb 25 09:57:28 devstack-centos7-rax-dfw-979654 kernel: mysqld invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Feb 25 10:40:38 devstack-centos7-rax-dfw-979654 kernel: mysqld invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 steps to recreate are http://paste.openstack.org/show/181303/ as discussed in IRC (for the sake of others following along). I've held a similar worker in HPCloud (15.126.235.20) which is a 30GB flavor We ran 2 runs in total in the hpcloud provider VM (and this time it was setup correctly with 8g ram, as evident from /proc/meminfo as well as dstat output) Run1) It was successfull. The setup had glusterfs as a storage backend for Cinder. Only 2 testcases failed, they were expected. No oom happened. [deepakcs@deepakcs r7-jeremy-hpcloud-vm]$ grep oom-killer run1-w-gluster/logs/syslog.txt [deepakcs@deepakcs r7-jeremy-hpcloud-vm]$ Run 2) Since run1 went fine, we enabled tempest volume backup testcases too and ran again. It was successfull and no oom happened. [deepakcs@deepakcs r7-jeremy-hpcloud-vm]$ grep oom-killer run2-w-gluster/logs/syslog.txt [deepakcs@deepakcs r7-jeremy-hpcloud-vm]$ artifically limited to 8GB through a kernel boot parameter. Hopefully following the same steps there will help either confirm the issue isn't specific to running in one particular service provider, or will yield some useful difference which could help highlight the cause. So from the above we can conclude that the tests are running fine on hpcloud and not on rax provider. Since the OS (centos7) inside the VM across provider is same, this now boils down to some issue with rax provider VM + centos7 combination. Another data point I could gather is: The only other centos7 job we have is check-tempest-dsvm-centos7 and it does not run full tempest looking at the job's config it only runs smoke tests (also confirmed the same with Ian W) which i believe is a subset of tests only. So that brings to the conclusion that probably cinder-glusterfs CI job (check-tempest-dsvm-full-glusterfs-centos7) is the first centos7 based job running full tempest tests in upstream CI and hence is the first to hit the issue , but on rax provider only thanx, deepak __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures
Ran the job manually on rax VM, provided by Jeremy. (Thank you Jeremy). After running 971 test cases VM inaccessible for 569 ticks, then continues... (Look at the console.log [1]) And also have a look at dstat log. [2] The summary is: == Totals == Ran: 1125 tests in 5835. sec. - Passed: 960 - Skipped: 88 - Expected Fail: 0 - Unexpected Success: 0 - Failed: 77 Sum of execute time for each test: 13603.6755 sec. [1] https://etherpad.openstack.org/p/rax_console.txt [2] https://etherpad.openstack.org/p/rax_dstat.log On 02/24/2015 07:03 PM, Deepak Shetty wrote: FWIW, we tried to run our job in a rax provider VM (provided by ianw from his personal account) and we ran the tempest tests twice, but the OOM did not re-create. Of the 2 runs, one of the run used the same PYTHONHASHSEED as we had in one of the failed runs, still no oom. Jeremy graciously agreed to provide us 2 VMs , one each from rax and hpcloud provider to see if provider platform has anything to do with it. So we plan to run again wtih the VMs given from Jeremy , post which i will send next update here. thanx, deepak On Tue, Feb 24, 2015 at 4:50 AM, Jeremy Stanley fu...@yuggoth.org mailto:fu...@yuggoth.org wrote: Due to an image setup bug (I have a fix proposed currently), I was able to rerun this on a VM in HPCloud with 30GB memory and it completed in about an hour with a couple of tempest tests failing. Logs at: http://fungi.yuggoth.org/tmp/logs3.tar Rerunning again on another 8GB Rackspace VM with the job timeout increased to 5 hours, I was able to recreate the network connectivity issues exhibited previously. The job itself seems to have run for roughly 3 hours while failing 15 tests, and the worker was mostly unreachable for a while at the end (I don't know exactly how long) until around the time it completed. The OOM condition is present this time too according to the logs, occurring right near the end of the job. Collected logs are available at: http://fungi.yuggoth.org/tmp/logs4.tar Given the comparison between these two runs, I suspect this is either caused by memory constraints or block device I/O performance differences (or perhaps an unhappy combination of the two). Hopefully a close review of the logs will indicate which. -- Jeremy Stanley __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Warm Regards, Bharat Kumar Kobagana Software Engineer OpenStack Storage – RedHat India Mobile - +91 9949278005 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures
FWIW, we tried to run our job in a rax provider VM (provided by ianw from his personal account) and we ran the tempest tests twice, but the OOM did not re-create. Of the 2 runs, one of the run used the same PYTHONHASHSEED as we had in one of the failed runs, still no oom. Jeremy graciously agreed to provide us 2 VMs , one each from rax and hpcloud provider to see if provider platform has anything to do with it. So we plan to run again wtih the VMs given from Jeremy , post which i will send next update here. thanx, deepak On Tue, Feb 24, 2015 at 4:50 AM, Jeremy Stanley fu...@yuggoth.org wrote: Due to an image setup bug (I have a fix proposed currently), I was able to rerun this on a VM in HPCloud with 30GB memory and it completed in about an hour with a couple of tempest tests failing. Logs at: http://fungi.yuggoth.org/tmp/logs3.tar Rerunning again on another 8GB Rackspace VM with the job timeout increased to 5 hours, I was able to recreate the network connectivity issues exhibited previously. The job itself seems to have run for roughly 3 hours while failing 15 tests, and the worker was mostly unreachable for a while at the end (I don't know exactly how long) until around the time it completed. The OOM condition is present this time too according to the logs, occurring right near the end of the job. Collected logs are available at: http://fungi.yuggoth.org/tmp/logs4.tar Given the comparison between these two runs, I suspect this is either caused by memory constraints or block device I/O performance differences (or perhaps an unhappy combination of the two). Hopefully a close review of the logs will indicate which. -- Jeremy Stanley __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures
On Fri, Feb 20, 2015 at 10:49:29AM -0800, Joe Gordon wrote: On Fri, Feb 20, 2015 at 7:29 AM, Deepak Shetty dpkshe...@gmail.com wrote: Hi Jeremy, Couldn't find anything strong in the logs to back the reason for OOM. At the time OOM happens, mysqld and java processes have the most RAM hence OOM selects mysqld (4.7G) to be killed. From a glusterfs backend perspective, i haven't found anything suspicious, and we don't have the logs of glusterfs (which is typically in /var/log/glusterfs) so can't delve inside glusterfs too much :( BharatK (in CC) also tried to re-create the issue in local VM setup, but it hasn't yet! Having said that,* we do know* that we started seeing this issue after we enabled the nova-assisted-snapshot tests (by changing nova' s policy.json to enable non-admin to create hyp-assisted snaps). We think that enabling online snaps might have added to the number of tests and memory load thats the only clue we have as of now! It looks like OOM killer hit while qemu was busy and during a ServerRescueTest. Maybe libvirt logs would be useful as well? And I don't see any tempest tests calling assisted-volume-snapshots Also this looks odd: Feb 19 18:47:16 devstack-centos7-rax-iad-916633.slave.openstack.org libvirtd[3753]: missing __com.redhat_reason in disk io error event So that specific error message is harmless - the __com.redhat_reason field is nothing important from OpenStack's POV. However, it is interesting that QEMU is seeing an I/O error in the first place. This occurs when you have a grow on demand file, and the underlying storage is full, so unable to allocate more blocks to cope with a guest write. It can also occur if the underlying storage has a fatal I/O problem, eg dead sector in harddisk, or the some equivalent. IOW, I'd not expect to see any I/O errors raised from OpenStack in a normal scenario. So this is something to consider investigating. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures
On 2015-02-25 01:02:07 +0530 (+0530), Bharat Kumar wrote: [...] After running 971 test cases VM inaccessible for 569 ticks [...] Glad you're able to reproduce it. For the record that is running their 8GB performance flavor with a CentOS 7 PVHVM base image. The steps to recreate are http://paste.openstack.org/show/181303/ as discussed in IRC (for the sake of others following along). I've held a similar worker in HPCloud (15.126.235.20) which is a 30GB flavor artifically limited to 8GB through a kernel boot parameter. Hopefully following the same steps there will help either confirm the issue isn't specific to running in one particular service provider, or will yield some useful difference which could help highlight the cause. Either way, once 104.239.136.99 and 15.126.235.20 are no longer needed, please let one of the infrastructure root admins know to delete them. -- Jeremy Stanley __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures
On Feb 21, 2015 12:20 AM, Jeremy Stanley fu...@yuggoth.org wrote: On 2015-02-20 16:29:31 +0100 (+0100), Deepak Shetty wrote: Couldn't find anything strong in the logs to back the reason for OOM. At the time OOM happens, mysqld and java processes have the most RAM hence OOM selects mysqld (4.7G) to be killed. [...] Today I reran it after you rolled back some additional tests, and it runs for about 117 minutes before the OOM killer shoots nova-compute in the head. At your request I've added /var/log/glusterfs into the tarball this time: http://fungi.yuggoth.org/tmp/logs2.tar Thanks jeremy, can we get ssh access to one of these env to debug? Thanks Deepak -- Jeremy Stanley __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures
On Feb 21, 2015 12:26 AM, Joe Gordon joe.gord...@gmail.com wrote: On Fri, Feb 20, 2015 at 7:29 AM, Deepak Shetty dpkshe...@gmail.com wrote: Hi Jeremy, Couldn't find anything strong in the logs to back the reason for OOM. At the time OOM happens, mysqld and java processes have the most RAM hence OOM selects mysqld (4.7G) to be killed. From a glusterfs backend perspective, i haven't found anything suspicious, and we don't have the logs of glusterfs (which is typically in /var/log/glusterfs) so can't delve inside glusterfs too much :( BharatK (in CC) also tried to re-create the issue in local VM setup, but it hasn't yet! Having said that, we do know that we started seeing this issue after we enabled the nova-assisted-snapshot tests (by changing nova' s policy.json to enable non-admin to create hyp-assisted snaps). We think that enabling online snaps might have added to the number of tests and memory load thats the only clue we have as of now! It looks like OOM killer hit while qemu was busy and during a ServerRescueTest. Maybe libvirt logs would be useful as well? Thanks for the data point, will look at this test to understand more what's happening And I don't see any tempest tests calling assisted-volume-snapshots Maybe it still hasn't reached to it yet. Thanks Deepak Also this looks odd: Feb 19 18:47:16 devstack-centos7-rax-iad-916633.slave.openstack.org libvirtd[3753]: missing __com.redhat_reason in disk io error event So : 1) BharatK has merged the patch ( https://review.openstack.org/#/c/157707/ ) to revert the policy.json in the glusterfs job. So no more nova-assisted-snap tests. 2) We also are increasing the timeout of our job in patch ( https://review.openstack.org/#/c/157835/1 ) so that we can get a full run without timeouts to do a good analysis of the logs (logs are not posted if the job times out) Can you please re-enable our job, so that we can confirm that disabling online snap TCs is helping the issue, which if it does, can help us narrow down the issue. We also plan to monitor debug over the weekend hence having the job enabled can help us a lot. thanx, deepak On Thu, Feb 19, 2015 at 10:37 PM, Jeremy Stanley fu...@yuggoth.org wrote: On 2015-02-19 17:03:49 +0100 (+0100), Deepak Shetty wrote: [...] For some reason we are seeing the centos7 glusterfs CI job getting aborted/ killed either by Java exception or the build getting aborted due to timeout. [...] Hoping to root cause this soon and get the cinder-glusterfs CI job back online soon. I manually reran the same commands this job runs on an identical virtual machine and was able to reproduce some substantial weirdness. I temporarily lost remote access to the VM around 108 minutes into running the job (~17:50 in the logs) and the out of band console also became unresponsive to carriage returns. The machine's IP address still responded to ICMP ping, but attempts to open new TCP sockets to the SSH service never got a protocol version banner back. After about 10 minutes of that I went out to lunch but left everything untouched. To my excitement it was up and responding again when I returned. It appears from the logs that it runs well past the 120-minute mark where devstack-gate tries to kill the gate hook for its configured timeout. Somewhere around 165 minutes in (18:47) you can see the kernel out-of-memory killer starts to kick in and kill httpd and mysqld processes according to the syslog. Hopefully this is enough additional detail to get you a start at finding the root cause so that we can reenable your job. Let me know if there's anything else you need for this. [1] http://fungi.yuggoth.org/tmp/logs.tar -- Jeremy Stanley __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures
Hi Jeremy, Couldn't find anything strong in the logs to back the reason for OOM. At the time OOM happens, mysqld and java processes have the most RAM hence OOM selects mysqld (4.7G) to be killed. From a glusterfs backend perspective, i haven't found anything suspicious, and we don't have the logs of glusterfs (which is typically in /var/log/glusterfs) so can't delve inside glusterfs too much :( BharatK (in CC) also tried to re-create the issue in local VM setup, but it hasn't yet! Having said that,* we do know* that we started seeing this issue after we enabled the nova-assisted-snapshot tests (by changing nova' s policy.json to enable non-admin to create hyp-assisted snaps). We think that enabling online snaps might have added to the number of tests and memory load thats the only clue we have as of now! So : 1) BharatK has merged the patch ( https://review.openstack.org/#/c/157707/ ) to revert the policy.json in the glusterfs job. So no more nova-assisted-snap tests. 2) We also are increasing the timeout of our job in patch ( https://review.openstack.org/#/c/157835/1 ) so that we can get a full run without timeouts to do a good analysis of the logs (logs are not posted if the job times out) Can you please re-enable our job, so that we can confirm that disabling online snap TCs is helping the issue, which if it does, can help us narrow down the issue. We also plan to monitor debug over the weekend hence having the job enabled can help us a lot. thanx, deepak On Thu, Feb 19, 2015 at 10:37 PM, Jeremy Stanley fu...@yuggoth.org wrote: On 2015-02-19 17:03:49 +0100 (+0100), Deepak Shetty wrote: [...] For some reason we are seeing the centos7 glusterfs CI job getting aborted/ killed either by Java exception or the build getting aborted due to timeout. [...] Hoping to root cause this soon and get the cinder-glusterfs CI job back online soon. I manually reran the same commands this job runs on an identical virtual machine and was able to reproduce some substantial weirdness. I temporarily lost remote access to the VM around 108 minutes into running the job (~17:50 in the logs) and the out of band console also became unresponsive to carriage returns. The machine's IP address still responded to ICMP ping, but attempts to open new TCP sockets to the SSH service never got a protocol version banner back. After about 10 minutes of that I went out to lunch but left everything untouched. To my excitement it was up and responding again when I returned. It appears from the logs that it runs well past the 120-minute mark where devstack-gate tries to kill the gate hook for its configured timeout. Somewhere around 165 minutes in (18:47) you can see the kernel out-of-memory killer starts to kick in and kill httpd and mysqld processes according to the syslog. Hopefully this is enough additional detail to get you a start at finding the root cause so that we can reenable your job. Let me know if there's anything else you need for this. [1] http://fungi.yuggoth.org/tmp/logs.tar -- Jeremy Stanley __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures
On Fri, Feb 20, 2015 at 7:29 AM, Deepak Shetty dpkshe...@gmail.com wrote: Hi Jeremy, Couldn't find anything strong in the logs to back the reason for OOM. At the time OOM happens, mysqld and java processes have the most RAM hence OOM selects mysqld (4.7G) to be killed. From a glusterfs backend perspective, i haven't found anything suspicious, and we don't have the logs of glusterfs (which is typically in /var/log/glusterfs) so can't delve inside glusterfs too much :( BharatK (in CC) also tried to re-create the issue in local VM setup, but it hasn't yet! Having said that,* we do know* that we started seeing this issue after we enabled the nova-assisted-snapshot tests (by changing nova' s policy.json to enable non-admin to create hyp-assisted snaps). We think that enabling online snaps might have added to the number of tests and memory load thats the only clue we have as of now! It looks like OOM killer hit while qemu was busy and during a ServerRescueTest. Maybe libvirt logs would be useful as well? And I don't see any tempest tests calling assisted-volume-snapshots Also this looks odd: Feb 19 18:47:16 devstack-centos7-rax-iad-916633.slave.openstack.org libvirtd[3753]: missing __com.redhat_reason in disk io error event So : 1) BharatK has merged the patch ( https://review.openstack.org/#/c/157707/ ) to revert the policy.json in the glusterfs job. So no more nova-assisted-snap tests. 2) We also are increasing the timeout of our job in patch ( https://review.openstack.org/#/c/157835/1 ) so that we can get a full run without timeouts to do a good analysis of the logs (logs are not posted if the job times out) Can you please re-enable our job, so that we can confirm that disabling online snap TCs is helping the issue, which if it does, can help us narrow down the issue. We also plan to monitor debug over the weekend hence having the job enabled can help us a lot. thanx, deepak On Thu, Feb 19, 2015 at 10:37 PM, Jeremy Stanley fu...@yuggoth.org wrote: On 2015-02-19 17:03:49 +0100 (+0100), Deepak Shetty wrote: [...] For some reason we are seeing the centos7 glusterfs CI job getting aborted/ killed either by Java exception or the build getting aborted due to timeout. [...] Hoping to root cause this soon and get the cinder-glusterfs CI job back online soon. I manually reran the same commands this job runs on an identical virtual machine and was able to reproduce some substantial weirdness. I temporarily lost remote access to the VM around 108 minutes into running the job (~17:50 in the logs) and the out of band console also became unresponsive to carriage returns. The machine's IP address still responded to ICMP ping, but attempts to open new TCP sockets to the SSH service never got a protocol version banner back. After about 10 minutes of that I went out to lunch but left everything untouched. To my excitement it was up and responding again when I returned. It appears from the logs that it runs well past the 120-minute mark where devstack-gate tries to kill the gate hook for its configured timeout. Somewhere around 165 minutes in (18:47) you can see the kernel out-of-memory killer starts to kick in and kill httpd and mysqld processes according to the syslog. Hopefully this is enough additional detail to get you a start at finding the root cause so that we can reenable your job. Let me know if there's anything else you need for this. [1] http://fungi.yuggoth.org/tmp/logs.tar -- Jeremy Stanley __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures
On 2015-02-20 16:29:31 +0100 (+0100), Deepak Shetty wrote: Couldn't find anything strong in the logs to back the reason for OOM. At the time OOM happens, mysqld and java processes have the most RAM hence OOM selects mysqld (4.7G) to be killed. [...] Today I reran it after you rolled back some additional tests, and it runs for about 117 minutes before the OOM killer shoots nova-compute in the head. At your request I've added /var/log/glusterfs into the tarball this time: http://fungi.yuggoth.org/tmp/logs2.tar -- Jeremy Stanley __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures
On 2015-02-19 17:03:49 +0100 (+0100), Deepak Shetty wrote: [...] For some reason we are seeing the centos7 glusterfs CI job getting aborted/ killed either by Java exception or the build getting aborted due to timeout. [...] Hoping to root cause this soon and get the cinder-glusterfs CI job back online soon. I manually reran the same commands this job runs on an identical virtual machine and was able to reproduce some substantial weirdness. I temporarily lost remote access to the VM around 108 minutes into running the job (~17:50 in the logs) and the out of band console also became unresponsive to carriage returns. The machine's IP address still responded to ICMP ping, but attempts to open new TCP sockets to the SSH service never got a protocol version banner back. After about 10 minutes of that I went out to lunch but left everything untouched. To my excitement it was up and responding again when I returned. It appears from the logs that it runs well past the 120-minute mark where devstack-gate tries to kill the gate hook for its configured timeout. Somewhere around 165 minutes in (18:47) you can see the kernel out-of-memory killer starts to kick in and kill httpd and mysqld processes according to the syslog. Hopefully this is enough additional detail to get you a start at finding the root cause so that we can reenable your job. Let me know if there's anything else you need for this. [1] http://fungi.yuggoth.org/tmp/logs.tar -- Jeremy Stanley __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures
Hi clarkb, fungi, As discussed in http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2015-02-19.log ( 2015-02-19T14:51:46 onwards), I am starting this thread to track the abrupt job failures seen on cinder-glusterfs CI job in the recent past. A small summary of the things that happened until now ... For some reason we are seeing the centos7 glusterfs CI job getting aborted/killed either by Java exception or the build getting aborted due to timeout. *1) **https://jenkins07.openstack.org/job/check-tempest-dsvm-full-glusterfs-centos7/35/consoleFull https://jenkins07.openstack.org/job/check-tempest-dsvm-full-glusterfs-centos7/35/consoleFull - due to hudson Java exception* *2) https://jenkins07.openstack.org/job/check-tempest-dsvm-full-glusterfs-centos7/34/consoleFull https://jenkins07.openstack.org/job/check-tempest-dsvm-full-glusterfs-centos7/34/consoleFull - due to build timeout* For a list of all job failures, see https://jenkins07.openstack.org/job/check-tempest-dsvm-full-glusterfs-centos7/ Most of the failures are of type #1 As a result of whcih the cinder-glusterfs CI job was removed ... https://review.openstack.org/#/c/157213/ Per the discussion on IRC (see link above), fungi graciously agreed to debug this as it looks like happening on the 'rax' provider. Thanks fungi and clarkb :) Hoping to root cause this soon and get the cinder-glusterfs CI job back online soon. thanx, deepak __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev