Re: [openstack-dev] [cinder] [qa] which core team members are diving into - http://status.openstack.org/elastic-recheck/#1373513

2014-11-25 Thread John Griffith
On Tue, Nov 25, 2014 at 2:22 PM, Vishvananda Ishaya
 wrote:
>
> On Nov 25, 2014, at 7:29 AM, Matt Riedemann  
> wrote:
>
>>
>>
>> On 11/25/2014 9:03 AM, Matt Riedemann wrote:
>>>
>>>
>>> On 11/25/2014 8:11 AM, Sean Dague wrote:
 There is currently a review stream coming into Tempest to add Cinder v2
 tests in addition to the Cinder v1 tests. At the same time the currently
 biggest race fail in the gate related to the projects is
 http://status.openstack.org/elastic-recheck/#1373513 - which is cinder
 related.

 I believe these 2 facts are coupled. The number of volume tests we have
 in tempest is somewhat small, and as such the likelihood of them running
 simultaneously is also small. However the fact that as the # of tests
 with volumes goes up we are getting more of these race fails typically
 means that what's actually happening is 2 vol ops that aren't safe to
 run at the same time, are.

 This remains critical - https://bugs.launchpad.net/cinder/+bug/1373513 -
 with no assignee.

 So we really needs dedicated diving on this (last bug update with any
 code was a month ago), otherwise we need to stop adding these tests to
 Tempest, and honestly start skipping the volume tests if we can't have a
 repeatable success.

-Sean

>>>
>>> I just put up an e-r query for a newly opened bug
>>> https://bugs.launchpad.net/cinder/+bug/1396186 this morning, it looks
>>> similar to bug 1373513 but without the blocked task error in syslog.
>>>
>>> There is a three minute gap between when the volume is being deleted in
>>> c-vol logs and when we see the volume uuid logged again, at which point
>>> tempest has already timed out waiting for the delete to complete.
>>>
>>> We should at least get some patches to add diagnostic logging in these
>>> delete flows (or periodic tasks that use the same locks/low-level i/o
>>> bound commands?) to try and pinpoint these failures.
>>>
>>> I think I'm going to propose a skip patch for test_volume_boot_pattern
>>> since that just seems to be a never ending cause of pain until these
>>> root issues get fixed.
>>>
>>
>> I marked 1396186 as a duplicate of 1373513 since the e-r query for 1373513 
>> had an OR message which was the same as 1396186.
>>
>> I went ahead and proposed a skip for test_volume_boot_pattern due to bug 
>> 1373513 [1] until people get on top of debugging it.
>>
>> I added some notes to bug 1396186, the 3 minute hang seems to be due to a 
>> vgs call taking ~1 minute and an lvs call taking ~2 minutes.
>>
>> I'm not sure if those are hit in the volume delete flow or in some periodic 
>> task, but if there are multiple concurrent worker processes that could be 
>> hitting those commands at the same time can we look at off-loading one of 
>> them to a separate thread or something?
>
> Do we set up devstack to not zero volumes on delete 
> (CINDER_SECURE_DELETE=False) ? If not, the dd process could be hanging the 
> system due to io load. This would get significantly worse with multiple 
> deletes occurring simultaneously.
>
> Vish
>
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

I'm trying to dig into this, so far I believe it might be related to
the new activate/deactivate semantics (new bug, similar to the old bug
Vish referenced re secure delete).  Also, we are setting
secure_delete=False still so that isn't the issue (although the
underlying root cause may be the same?).

There are two things that I'm seeing so far:
1. lvdisplay taking up to 5 minutes to respond (but it does respond)
2. lvremove on snapshots failing due to suspended state in dm mapper
and the activation command apparently failing (need to look at this
some more still)

I've been trying to duplicate issue 1 which shows up very frequently
in the gate, however haven't had much luck.  I've been focused on just
running volume tests however and am now adding in the full devstack
gate tests.  Suspect there might be something load related here.

Haven't duplicated issue 2 so far either, but have some ideas of
things that might help here with dmsetup.

Anyway, Eric looked at it for a bit before Paris as well; I'll chat
with him tomorrow and continue looking at it myself as well.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [cinder] [qa] which core team members are diving into - http://status.openstack.org/elastic-recheck/#1373513

2014-11-25 Thread Matthew Treinish
On Tue, Nov 25, 2014 at 01:22:01PM -0800, Vishvananda Ishaya wrote:
> 
> On Nov 25, 2014, at 7:29 AM, Matt Riedemann  
> wrote:
> 
> > 
> > 
> > On 11/25/2014 9:03 AM, Matt Riedemann wrote:
> >> 
> >> 
> >> On 11/25/2014 8:11 AM, Sean Dague wrote:
> >>> There is currently a review stream coming into Tempest to add Cinder v2
> >>> tests in addition to the Cinder v1 tests. At the same time the currently
> >>> biggest race fail in the gate related to the projects is
> >>> http://status.openstack.org/elastic-recheck/#1373513 - which is cinder
> >>> related.
> >>> 
> >>> I believe these 2 facts are coupled. The number of volume tests we have
> >>> in tempest is somewhat small, and as such the likelihood of them running
> >>> simultaneously is also small. However the fact that as the # of tests
> >>> with volumes goes up we are getting more of these race fails typically
> >>> means that what's actually happening is 2 vol ops that aren't safe to
> >>> run at the same time, are.
> >>> 
> >>> This remains critical - https://bugs.launchpad.net/cinder/+bug/1373513 -
> >>> with no assignee.
> >>> 
> >>> So we really needs dedicated diving on this (last bug update with any
> >>> code was a month ago), otherwise we need to stop adding these tests to
> >>> Tempest, and honestly start skipping the volume tests if we can't have a
> >>> repeatable success.
> >>> 
> >>>-Sean
> >>> 
> >> 
> >> I just put up an e-r query for a newly opened bug
> >> https://bugs.launchpad.net/cinder/+bug/1396186 this morning, it looks
> >> similar to bug 1373513 but without the blocked task error in syslog.
> >> 
> >> There is a three minute gap between when the volume is being deleted in
> >> c-vol logs and when we see the volume uuid logged again, at which point
> >> tempest has already timed out waiting for the delete to complete.
> >> 
> >> We should at least get some patches to add diagnostic logging in these
> >> delete flows (or periodic tasks that use the same locks/low-level i/o
> >> bound commands?) to try and pinpoint these failures.
> >> 
> >> I think I'm going to propose a skip patch for test_volume_boot_pattern
> >> since that just seems to be a never ending cause of pain until these
> >> root issues get fixed.
> >> 
> > 
> > I marked 1396186 as a duplicate of 1373513 since the e-r query for 1373513 
> > had an OR message which was the same as 1396186.
> > 
> > I went ahead and proposed a skip for test_volume_boot_pattern due to bug 
> > 1373513 [1] until people get on top of debugging it.
> > 
> > I added some notes to bug 1396186, the 3 minute hang seems to be due to a 
> > vgs call taking ~1 minute and an lvs call taking ~2 minutes.
> > 
> > I'm not sure if those are hit in the volume delete flow or in some periodic 
> > task, but if there are multiple concurrent worker processes that could be 
> > hitting those commands at the same time can we look at off-loading one of 
> > them to a separate thread or something?
> 
> Do we set up devstack to not zero volumes on delete 
> (CINDER_SECURE_DELETE=False) ? If not, the dd process could be hanging the 
> system due to io load. This would get significantly worse with multiple 
> deletes occurring simultaneously.

Yes, we do that:

http://git.openstack.org/cgit/openstack-infra/devstack-gate/tree/devstack-vm-gate.sh#n139

and

http://git.openstack.org/cgit/openstack-infra/devstack-gate/tree/devstack-vm-gate-wrap.sh#n170

it can be overridden, but I don't think that any of the job definitions do that.

-Matt Treinish


pgpyqzwGQmI7O.pgp
Description: PGP signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [cinder] [qa] which core team members are diving into - http://status.openstack.org/elastic-recheck/#1373513

2014-11-25 Thread Vishvananda Ishaya

On Nov 25, 2014, at 7:29 AM, Matt Riedemann  wrote:

> 
> 
> On 11/25/2014 9:03 AM, Matt Riedemann wrote:
>> 
>> 
>> On 11/25/2014 8:11 AM, Sean Dague wrote:
>>> There is currently a review stream coming into Tempest to add Cinder v2
>>> tests in addition to the Cinder v1 tests. At the same time the currently
>>> biggest race fail in the gate related to the projects is
>>> http://status.openstack.org/elastic-recheck/#1373513 - which is cinder
>>> related.
>>> 
>>> I believe these 2 facts are coupled. The number of volume tests we have
>>> in tempest is somewhat small, and as such the likelihood of them running
>>> simultaneously is also small. However the fact that as the # of tests
>>> with volumes goes up we are getting more of these race fails typically
>>> means that what's actually happening is 2 vol ops that aren't safe to
>>> run at the same time, are.
>>> 
>>> This remains critical - https://bugs.launchpad.net/cinder/+bug/1373513 -
>>> with no assignee.
>>> 
>>> So we really needs dedicated diving on this (last bug update with any
>>> code was a month ago), otherwise we need to stop adding these tests to
>>> Tempest, and honestly start skipping the volume tests if we can't have a
>>> repeatable success.
>>> 
>>>-Sean
>>> 
>> 
>> I just put up an e-r query for a newly opened bug
>> https://bugs.launchpad.net/cinder/+bug/1396186 this morning, it looks
>> similar to bug 1373513 but without the blocked task error in syslog.
>> 
>> There is a three minute gap between when the volume is being deleted in
>> c-vol logs and when we see the volume uuid logged again, at which point
>> tempest has already timed out waiting for the delete to complete.
>> 
>> We should at least get some patches to add diagnostic logging in these
>> delete flows (or periodic tasks that use the same locks/low-level i/o
>> bound commands?) to try and pinpoint these failures.
>> 
>> I think I'm going to propose a skip patch for test_volume_boot_pattern
>> since that just seems to be a never ending cause of pain until these
>> root issues get fixed.
>> 
> 
> I marked 1396186 as a duplicate of 1373513 since the e-r query for 1373513 
> had an OR message which was the same as 1396186.
> 
> I went ahead and proposed a skip for test_volume_boot_pattern due to bug 
> 1373513 [1] until people get on top of debugging it.
> 
> I added some notes to bug 1396186, the 3 minute hang seems to be due to a vgs 
> call taking ~1 minute and an lvs call taking ~2 minutes.
> 
> I'm not sure if those are hit in the volume delete flow or in some periodic 
> task, but if there are multiple concurrent worker processes that could be 
> hitting those commands at the same time can we look at off-loading one of 
> them to a separate thread or something?

Do we set up devstack to not zero volumes on delete 
(CINDER_SECURE_DELETE=False) ? If not, the dd process could be hanging the 
system due to io load. This would get significantly worse with multiple deletes 
occurring simultaneously.

Vish



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [cinder] [qa] which core team members are diving into - http://status.openstack.org/elastic-recheck/#1373513

2014-11-25 Thread Matt Riedemann



On 11/25/2014 9:03 AM, Matt Riedemann wrote:



On 11/25/2014 8:11 AM, Sean Dague wrote:

There is currently a review stream coming into Tempest to add Cinder v2
tests in addition to the Cinder v1 tests. At the same time the currently
biggest race fail in the gate related to the projects is
http://status.openstack.org/elastic-recheck/#1373513 - which is cinder
related.

I believe these 2 facts are coupled. The number of volume tests we have
in tempest is somewhat small, and as such the likelihood of them running
simultaneously is also small. However the fact that as the # of tests
with volumes goes up we are getting more of these race fails typically
means that what's actually happening is 2 vol ops that aren't safe to
run at the same time, are.

This remains critical - https://bugs.launchpad.net/cinder/+bug/1373513 -
with no assignee.

So we really needs dedicated diving on this (last bug update with any
code was a month ago), otherwise we need to stop adding these tests to
Tempest, and honestly start skipping the volume tests if we can't have a
repeatable success.

-Sean



I just put up an e-r query for a newly opened bug
https://bugs.launchpad.net/cinder/+bug/1396186 this morning, it looks
similar to bug 1373513 but without the blocked task error in syslog.

There is a three minute gap between when the volume is being deleted in
c-vol logs and when we see the volume uuid logged again, at which point
tempest has already timed out waiting for the delete to complete.

We should at least get some patches to add diagnostic logging in these
delete flows (or periodic tasks that use the same locks/low-level i/o
bound commands?) to try and pinpoint these failures.

I think I'm going to propose a skip patch for test_volume_boot_pattern
since that just seems to be a never ending cause of pain until these
root issues get fixed.



I marked 1396186 as a duplicate of 1373513 since the e-r query for 
1373513 had an OR message which was the same as 1396186.


I went ahead and proposed a skip for test_volume_boot_pattern due to bug 
1373513 [1] until people get on top of debugging it.


I added some notes to bug 1396186, the 3 minute hang seems to be due to 
a vgs call taking ~1 minute and an lvs call taking ~2 minutes.


I'm not sure if those are hit in the volume delete flow or in some 
periodic task, but if there are multiple concurrent worker processes 
that could be hitting those commands at the same time can we look at 
off-loading one of them to a separate thread or something?


[1] https://review.openstack.org/#/c/137096/

--

Thanks,

Matt Riedemann


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [cinder] [qa] which core team members are diving into - http://status.openstack.org/elastic-recheck/#1373513

2014-11-25 Thread Matt Riedemann



On 11/25/2014 8:11 AM, Sean Dague wrote:

There is currently a review stream coming into Tempest to add Cinder v2
tests in addition to the Cinder v1 tests. At the same time the currently
biggest race fail in the gate related to the projects is
http://status.openstack.org/elastic-recheck/#1373513 - which is cinder
related.

I believe these 2 facts are coupled. The number of volume tests we have
in tempest is somewhat small, and as such the likelihood of them running
simultaneously is also small. However the fact that as the # of tests
with volumes goes up we are getting more of these race fails typically
means that what's actually happening is 2 vol ops that aren't safe to
run at the same time, are.

This remains critical - https://bugs.launchpad.net/cinder/+bug/1373513 -
with no assignee.

So we really needs dedicated diving on this (last bug update with any
code was a month ago), otherwise we need to stop adding these tests to
Tempest, and honestly start skipping the volume tests if we can't have a
repeatable success.

-Sean



I just put up an e-r query for a newly opened bug 
https://bugs.launchpad.net/cinder/+bug/1396186 this morning, it looks 
similar to bug 1373513 but without the blocked task error in syslog.


There is a three minute gap between when the volume is being deleted in 
c-vol logs and when we see the volume uuid logged again, at which point 
tempest has already timed out waiting for the delete to complete.


We should at least get some patches to add diagnostic logging in these 
delete flows (or periodic tasks that use the same locks/low-level i/o 
bound commands?) to try and pinpoint these failures.


I think I'm going to propose a skip patch for test_volume_boot_pattern 
since that just seems to be a never ending cause of pain until these 
root issues get fixed.


--

Thanks,

Matt Riedemann


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [cinder] [qa] which core team members are diving into - http://status.openstack.org/elastic-recheck/#1373513

2014-11-25 Thread Sean Dague
There is currently a review stream coming into Tempest to add Cinder v2
tests in addition to the Cinder v1 tests. At the same time the currently
biggest race fail in the gate related to the projects is
http://status.openstack.org/elastic-recheck/#1373513 - which is cinder
related.

I believe these 2 facts are coupled. The number of volume tests we have
in tempest is somewhat small, and as such the likelihood of them running
simultaneously is also small. However the fact that as the # of tests
with volumes goes up we are getting more of these race fails typically
means that what's actually happening is 2 vol ops that aren't safe to
run at the same time, are.

This remains critical - https://bugs.launchpad.net/cinder/+bug/1373513 -
with no assignee.

So we really needs dedicated diving on this (last bug update with any
code was a month ago), otherwise we need to stop adding these tests to
Tempest, and honestly start skipping the volume tests if we can't have a
repeatable success.

-Sean

-- 
Sean Dague
http://dague.net

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev