Re: [openstack-dev] [all][qa][glance] some recent tempest problems

2017-06-27 Thread Ghanshyam Mann
On Mon, Jun 26, 2017 at 11:58 PM, Eric Harney  wrote:
> On 06/19/2017 09:22 AM, Matt Riedemann wrote:
>> On 6/16/2017 8:58 AM, Eric Harney wrote:
>>> I'm not convinced yet that this failure is purely Ceph-specific, at a
>>> quick look.
>>>
>>> I think what happens here is, unshelve performs an asynchronous delete
>>> of a glance image, and returns as successful before the delete has
>>> necessarily completed.  The check in tempest then sees that the image
>>> still exists, and fails -- but this isn't valid, because the unshelve
>>> API doesn't guarantee that this image is no longer there at the time it
>>> returns.  This would fail on any image delete that isn't instantaneous.
>>>
>>> Is there a guarantee anywhere that the unshelve API behaves how this
>>> tempest test expects it to?
>>
>> There are no guarantees, no. The unshelve API reference is here [1]. The
>> asynchronous postconditions section just says:
>>
>> "After you successfully shelve a server, its status changes to ACTIVE.
>> The server appears on the compute node.
>>
>> The shelved image is deleted from the list of images returned by an API
>> call."
>>
>> It doesn't say the image is deleted immediately, or that it waits for
>> the image to be gone before changing the instance status to ACTIVE.
>>
>> I see there is also a typo in there, that should say after you
>> successfully *unshelve* a server.
>>
>> From an API user point of view, this is all asynchronous because it's an
>> RPC cast from the nova-api service to the nova-conductor and finally
>> nova-compute service when unshelving the instance.
>>
>> So I think the test is making some wrong assumptions on how fast the
>> image is going to be deleted when the instance is active.
>>
>> As Ken'ichi pointed out in the Tempest change, Glance returns a 204 when
>> deleting an image in the v2 API [2]. If the image delete is asynchronous
>> then that should probably be a 202.
>>
>> Either way the Tempest test should probably be in a wait loop for the
>> image to be gone if it's really going to assert this.
>>
>
> Thanks for confirming this.
>
> What do we need to do to get this fixed in Tempest?  Nobody from Tempest
> Core has responded to the revert patch [3] since this explanation was
> posted.
>
> IMO we should revert this for now and someone can implement a fixed
> version if this test is needed.

Sorry for delay. Let's fix this instead of revert  -
https://review.openstack.org/#/c/477821/

-gmann

>
> [3] https://review.openstack.org/#/c/471352/
>
>> [1]
>> https://developer.openstack.org/api-ref/compute/?expanded=unshelve-restore-shelved-server-unshelve-action-detail#unshelve-restore-shelved-server-unshelve-action
>>
>> [2]
>> https://developer.openstack.org/api-ref/image/v2/index.html?expanded=delete-an-image-detail#delete-an-image
>>
>>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][qa][glance] some recent tempest problems

2017-06-26 Thread Eric Harney
On 06/19/2017 09:22 AM, Matt Riedemann wrote:
> On 6/16/2017 8:58 AM, Eric Harney wrote:
>> I'm not convinced yet that this failure is purely Ceph-specific, at a
>> quick look.
>>
>> I think what happens here is, unshelve performs an asynchronous delete
>> of a glance image, and returns as successful before the delete has
>> necessarily completed.  The check in tempest then sees that the image
>> still exists, and fails -- but this isn't valid, because the unshelve
>> API doesn't guarantee that this image is no longer there at the time it
>> returns.  This would fail on any image delete that isn't instantaneous.
>>
>> Is there a guarantee anywhere that the unshelve API behaves how this
>> tempest test expects it to?
> 
> There are no guarantees, no. The unshelve API reference is here [1]. The
> asynchronous postconditions section just says:
> 
> "After you successfully shelve a server, its status changes to ACTIVE.
> The server appears on the compute node.
> 
> The shelved image is deleted from the list of images returned by an API
> call."
> 
> It doesn't say the image is deleted immediately, or that it waits for
> the image to be gone before changing the instance status to ACTIVE.
> 
> I see there is also a typo in there, that should say after you
> successfully *unshelve* a server.
> 
> From an API user point of view, this is all asynchronous because it's an
> RPC cast from the nova-api service to the nova-conductor and finally
> nova-compute service when unshelving the instance.
> 
> So I think the test is making some wrong assumptions on how fast the
> image is going to be deleted when the instance is active.
> 
> As Ken'ichi pointed out in the Tempest change, Glance returns a 204 when
> deleting an image in the v2 API [2]. If the image delete is asynchronous
> then that should probably be a 202.
> 
> Either way the Tempest test should probably be in a wait loop for the
> image to be gone if it's really going to assert this.
> 

Thanks for confirming this.

What do we need to do to get this fixed in Tempest?  Nobody from Tempest
Core has responded to the revert patch [3] since this explanation was
posted.

IMO we should revert this for now and someone can implement a fixed
version if this test is needed.

[3] https://review.openstack.org/#/c/471352/

> [1]
> https://developer.openstack.org/api-ref/compute/?expanded=unshelve-restore-shelved-server-unshelve-action-detail#unshelve-restore-shelved-server-unshelve-action
> 
> [2]
> https://developer.openstack.org/api-ref/image/v2/index.html?expanded=delete-an-image-detail#delete-an-image
> 
> 


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][qa][glance] some recent tempest problems

2017-06-19 Thread Matt Riedemann

On 6/16/2017 8:58 AM, Eric Harney wrote:

I'm not convinced yet that this failure is purely Ceph-specific, at a
quick look.

I think what happens here is, unshelve performs an asynchronous delete
of a glance image, and returns as successful before the delete has
necessarily completed.  The check in tempest then sees that the image
still exists, and fails -- but this isn't valid, because the unshelve
API doesn't guarantee that this image is no longer there at the time it
returns.  This would fail on any image delete that isn't instantaneous.

Is there a guarantee anywhere that the unshelve API behaves how this
tempest test expects it to?


There are no guarantees, no. The unshelve API reference is here [1]. The 
asynchronous postconditions section just says:


"After you successfully shelve a server, its status changes to ACTIVE. 
The server appears on the compute node.


The shelved image is deleted from the list of images returned by an API 
call."


It doesn't say the image is deleted immediately, or that it waits for 
the image to be gone before changing the instance status to ACTIVE.


I see there is also a typo in there, that should say after you 
successfully *unshelve* a server.


From an API user point of view, this is all asynchronous because it's 
an RPC cast from the nova-api service to the nova-conductor and finally 
nova-compute service when unshelving the instance.


So I think the test is making some wrong assumptions on how fast the 
image is going to be deleted when the instance is active.


As Ken'ichi pointed out in the Tempest change, Glance returns a 204 when 
deleting an image in the v2 API [2]. If the image delete is asynchronous 
then that should probably be a 202.


Either way the Tempest test should probably be in a wait loop for the 
image to be gone if it's really going to assert this.


[1] 
https://developer.openstack.org/api-ref/compute/?expanded=unshelve-restore-shelved-server-unshelve-action-detail#unshelve-restore-shelved-server-unshelve-action
[2] 
https://developer.openstack.org/api-ref/image/v2/index.html?expanded=delete-an-image-detail#delete-an-image


--

Thanks,

Matt

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][qa][glance] some recent tempest problems

2017-06-16 Thread Matt Riedemann

On 6/16/2017 9:46 AM, Eric Harney wrote:

On 06/16/2017 10:21 AM, Sean McGinnis wrote:


I don't think merging tests that are showing failures, then blacklisting
them, is the right approach. And as Eric points out, this isn't
necessarily just a failure with Ceph. There is a legitimate logical
issue with what this particular test is doing.

But in general, to get back to some of the earlier points, I don't think
we should be merging tests with known breakages until those breakages
can be first addressed.



As another example, this was the last round of this, in May:

https://review.openstack.org/#/c/332670/

which is a new tempest test for a Cinder API that is not supported by
all drivers.  The Ceph job failed on the tempest patch, correctly, the
test was merged, then the Ceph jobs broke:

https://bugs.launchpad.net/glance/+bug/1687538
https://review.openstack.org/#/c/461625/

This is really not a sustainable model.

And this is the _easy_ case, since Ceph jobs run in OpenStack infra and
are easily visible and trackable.  I'm not sure what the impact is on
Cinder third-party CI for other drivers.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



This is generally why we have config options in Tempest to not run tests 
that certain backends don't implement, like all of the backup/snapshot 
volume tests that the NFS job was failing on forever.


I think it's perfectly valid to have tests in Tempest for things that 
not all backends implement as long as they are configurable. It's up to 
the various CI jobs to configure Tempest properly for what they support 
and then work on reducing the number of things they don't support. We've 
been doing that for ages now.


--

Thanks,

Matt

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][qa][glance] some recent tempest problems

2017-06-16 Thread Matt Riedemann

On 6/16/2017 8:13 PM, Matt Riedemann wrote:
Yeah there is a distinction between the ceph nv job that runs on 
nova/cinder/glance changes and the ceph job that runs on os-brick and 
glance_store changes. When we made the tempest dsvm ceph job non-voting 
we failed to mirror that in the os-brick/glance-store jobs. We should do 
that.


Here you go:

https://review.openstack.org/#/c/475095/

--

Thanks,

Matt

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][qa][glance] some recent tempest problems

2017-06-16 Thread Matt Riedemann

On 6/16/2017 3:32 PM, Sean McGinnis wrote:


So, before we go further, ceph seems to be -nv on all projects right
now, right? So I get there is some debate on that patch, but is it
blocking anything?



Ceph is voting on os-brick patches. So it does block some things when
we run into this situation.

But again, we should avoid getting into this situation in the first
place, voting or no.


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



Yeah there is a distinction between the ceph nv job that runs on 
nova/cinder/glance changes and the ceph job that runs on os-brick and 
glance_store changes. When we made the tempest dsvm ceph job non-voting 
we failed to mirror that in the os-brick/glance-store jobs. We should do 
that.


--

Thanks,

Matt

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][qa][glance] some recent tempest problems

2017-06-16 Thread Sean McGinnis
> 
> So, before we go further, ceph seems to be -nv on all projects right
> now, right? So I get there is some debate on that patch, but is it
> blocking anything?
> 

Ceph is voting on os-brick patches. So it does block some things when
we run into this situation.

But again, we should avoid getting into this situation in the first
place, voting or no.


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][qa][glance] some recent tempest problems

2017-06-16 Thread Sean Dague
On 06/16/2017 10:46 AM, Eric Harney wrote:
> On 06/16/2017 10:21 AM, Sean McGinnis wrote:
>>
>> I don't think merging tests that are showing failures, then blacklisting
>> them, is the right approach. And as Eric points out, this isn't
>> necessarily just a failure with Ceph. There is a legitimate logical
>> issue with what this particular test is doing.
>>
>> But in general, to get back to some of the earlier points, I don't think
>> we should be merging tests with known breakages until those breakages
>> can be first addressed.
>>
> 
> As another example, this was the last round of this, in May:
> 
> https://review.openstack.org/#/c/332670/
> 
> which is a new tempest test for a Cinder API that is not supported by
> all drivers.  The Ceph job failed on the tempest patch, correctly, the
> test was merged, then the Ceph jobs broke:
> 
> https://bugs.launchpad.net/glance/+bug/1687538
> https://review.openstack.org/#/c/461625/
> 
> This is really not a sustainable model.
> 
> And this is the _easy_ case, since Ceph jobs run in OpenStack infra and
> are easily visible and trackable.  I'm not sure what the impact is on
> Cinder third-party CI for other drivers.

Ah, so the issue is that
gate-tempest-dsvm-full-ceph-plugin-src-glance_store-ubuntu-xenial is
Voting, because when the regex was made to stop ceph jobs from voting
(which they aren't on Nova, Tempest, Glance, or Cinder), it wasn't
applied there.

It's also a question about why a library is doing different back end
testing through full stack testing, instead of more targeted and
controlled behavior. Which I think is probably also less than ideal.

Both would be good things to fix.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][qa][glance] some recent tempest problems

2017-06-16 Thread Sean Dague
On 06/16/2017 09:51 AM, Sean McGinnis wrote:
>>
>> It would be useful to provide detailed examples. Everything is trade
>> offs, and having the conversation in the abstract is very difficult to
>> understand those trade offs.
>>
>>  -Sean
>>
> 
> We've had this issue in Cinder and os-brick. Usually around Ceph, but if
> you follow the user survey, that's the most popular backend.
> 
> The problem we see is the tempest test that covers this is non-voting.
> And there have been several cases so far where this non-voting job does
> not pass, due to a legitimate failure, but the tempest patch merges anyway.
> 
> 
> To be fair, these failures usually do point out actual problems that need
> to be fixed. Not always, but at least in a few cases. But instead of it
> being addressed first to make sure there is no disruption, it's suddenly
> a blocking issue that holds up everything until it's either reverted, skipped,
> or the problem is resolved.
> 
> Here's one recent instance: https://review.openstack.org/#/c/471352/

So, before we go further, ceph seems to be -nv on all projects right
now, right? So I get there is some debate on that patch, but is it
blocking anything?

Again, we seem to be missing specifics and a set of events here, which
lacking that everyone is trying to guess what the problems are, which I
don't think is effective.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][qa][glance] some recent tempest problems

2017-06-16 Thread Eric Harney
On 06/16/2017 10:21 AM, Sean McGinnis wrote:
> 
> I don't think merging tests that are showing failures, then blacklisting
> them, is the right approach. And as Eric points out, this isn't
> necessarily just a failure with Ceph. There is a legitimate logical
> issue with what this particular test is doing.
> 
> But in general, to get back to some of the earlier points, I don't think
> we should be merging tests with known breakages until those breakages
> can be first addressed.
> 

As another example, this was the last round of this, in May:

https://review.openstack.org/#/c/332670/

which is a new tempest test for a Cinder API that is not supported by
all drivers.  The Ceph job failed on the tempest patch, correctly, the
test was merged, then the Ceph jobs broke:

https://bugs.launchpad.net/glance/+bug/1687538
https://review.openstack.org/#/c/461625/

This is really not a sustainable model.

And this is the _easy_ case, since Ceph jobs run in OpenStack infra and
are easily visible and trackable.  I'm not sure what the impact is on
Cinder third-party CI for other drivers.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][qa][glance] some recent tempest problems

2017-06-16 Thread Sean McGinnis
> 
> yea, we had such cases and decided to have blacklist of tests not
> suitable for ceph. ceph job will exclude the tests failing on ceph.
> Jon is working on this - https://review.openstack.org/#/c/459774/
> 

I don't think merging tests that are showing failures, then blacklisting
them, is the right approach. And as Eric points out, this isn't
necessarily just a failure with Ceph. There is a legitimate logical
issue with what this particular test is doing.

But in general, to get back to some of the earlier points, I don't think
we should be merging tests with known breakages until those breakages
can be first addressed.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][qa][glance] some recent tempest problems

2017-06-16 Thread Doug Hellmann
Excerpts from Ghanshyam Mann's message of 2017-06-16 23:05:08 +0900:
> On Fri, Jun 16, 2017 at 10:57 PM, Sean Dague  wrote:
> > On 06/16/2017 09:51 AM, Sean McGinnis wrote:
> >>>
> >>> It would be useful to provide detailed examples. Everything is trade
> >>> offs, and having the conversation in the abstract is very difficult to
> >>> understand those trade offs.
> >>>
> >>>  -Sean
> >>>
> >>
> >> We've had this issue in Cinder and os-brick. Usually around Ceph, but if
> >> you follow the user survey, that's the most popular backend.
> >>
> >> The problem we see is the tempest test that covers this is non-voting.
> >> And there have been several cases so far where this non-voting job does
> >> not pass, due to a legitimate failure, but the tempest patch merges anyway.
> >>
> >>
> >> To be fair, these failures usually do point out actual problems that need
> >> to be fixed. Not always, but at least in a few cases. But instead of it
> >> being addressed first to make sure there is no disruption, it's suddenly
> >> a blocking issue that holds up everything until it's either reverted, 
> >> skipped,
> >> or the problem is resolved.
> >>
> >> Here's one recent instance: https://review.openstack.org/#/c/471352/
> >
> > Sure, if ceph is the primary concern, that feels like it should be a
> > reasonable specific thing to fix. It's not a grand issue, it's a
> > specific mismatch on what configs should be common.
> 
> yea, we had such cases and decided to have blacklist of tests not
> suitable for ceph. ceph job will exclude the tests failing on ceph.
> Jon is working on this - https://review.openstack.org/#/c/459774/
> 
> This approach solve the problem without limiting tests scope. [1]
> 
> ..1 http://lists.openstack.org/pipermail/openstack-dev/2017-May/116172.html
> 
> -gmann

Is ceph behaving in an unexpected way or are the tests are making
implicit assumptions that might also cause trouble for other backends
if these tests ever make it into the suite used by the interop team?

Doug

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][qa][glance] some recent tempest problems

2017-06-16 Thread Ghanshyam Mann
On Fri, Jun 16, 2017 at 10:57 PM, Sean Dague  wrote:
> On 06/16/2017 09:51 AM, Sean McGinnis wrote:
>>>
>>> It would be useful to provide detailed examples. Everything is trade
>>> offs, and having the conversation in the abstract is very difficult to
>>> understand those trade offs.
>>>
>>>  -Sean
>>>
>>
>> We've had this issue in Cinder and os-brick. Usually around Ceph, but if
>> you follow the user survey, that's the most popular backend.
>>
>> The problem we see is the tempest test that covers this is non-voting.
>> And there have been several cases so far where this non-voting job does
>> not pass, due to a legitimate failure, but the tempest patch merges anyway.
>>
>>
>> To be fair, these failures usually do point out actual problems that need
>> to be fixed. Not always, but at least in a few cases. But instead of it
>> being addressed first to make sure there is no disruption, it's suddenly
>> a blocking issue that holds up everything until it's either reverted, 
>> skipped,
>> or the problem is resolved.
>>
>> Here's one recent instance: https://review.openstack.org/#/c/471352/
>
> Sure, if ceph is the primary concern, that feels like it should be a
> reasonable specific thing to fix. It's not a grand issue, it's a
> specific mismatch on what configs should be common.

yea, we had such cases and decided to have blacklist of tests not
suitable for ceph. ceph job will exclude the tests failing on ceph.
Jon is working on this - https://review.openstack.org/#/c/459774/

This approach solve the problem without limiting tests scope. [1]

..1 http://lists.openstack.org/pipermail/openstack-dev/2017-May/116172.html

-gmann

>
> -Sean
>
> --
> Sean Dague
> http://dague.net
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][qa][glance] some recent tempest problems

2017-06-16 Thread Eric Harney
On 06/15/2017 10:51 PM, Ghanshyam Mann wrote:
> On Fri, Jun 16, 2017 at 9:43 AM,  <zhu.fang...@zte.com.cn> wrote:
>> https://review.openstack.org/#/c/471352/   may be an example
> 
> If this is case which is ceph related, i think we already discussed
> these kind of cases, where functionality depends on backend storage
> and how to handle corresponding tests failure [1].
> 
> Solution on that was Ceph job should exclude such test case which
> functionality is not implemented/supported in ceph byregex. Jon
> Bernard is working on this tests blacklist [2].
> 
> If there is any other job or case, then we can discuss/think of having
> job running for Tempest gate also which i think we do in most cases.
> 
> And about making ceph job as voting, i remember we did not do that due
> to stability ok job. Ceph job fails frequently and once Jon patches
> merge and job is consistently stable then we can make voting.
> 

I'm not convinced yet that this failure is purely Ceph-specific, at a
quick look.

I think what happens here is, unshelve performs an asynchronous delete
of a glance image, and returns as successful before the delete has
necessarily completed.  The check in tempest then sees that the image
still exists, and fails -- but this isn't valid, because the unshelve
API doesn't guarantee that this image is no longer there at the time it
returns.  This would fail on any image delete that isn't instantaneous.

Is there a guarantee anywhere that the unshelve API behaves how this
tempest test expects it to?

>>
>>
>> Original Mail
>> Sender:  <s...@dague.net>;
>> To:  <openstack-dev@lists.openstack.org>;
>> Date: 2017/06/16 05:25
>> Subject: Re: [openstack-dev] [all][qa][glance] some recent tempest problems
>>
>>
>> On 06/15/2017 01:04 PM, Brian Rosmaita wrote:
>>> This isn't a glance-specific problem though we've encountered it quite
>>> a few times recently.
>>>
>>> Briefly, we're gating on Tempest jobs that tempest itself does not
>>> gate on.  This leads to a situation where new tests can be merged in
>>> tempest, but wind up breaking our gate. We aren't claiming that the
>>> added tests are bad or don't provide value; the problem is that we
>>> have to drop everything and fix the gate.  This interrupts our current
>>> work and forces us to prioritize bugs to fix based not on what makes
>>> the most sense for the project given current priorities and resources,
>>> but based on whatever we can do to get the gates un-blocked.
>>>
>>> As we said earlier, this situation seems to be impacting multiple
>>> projects.
>>>
>>> One solution for this is to change our gating so that we do not run
>>> any Tempest jobs against Glance repositories that are not also gated
>>> by Tempest.  That would in theory open a regression path, which is why
>>> we haven't put up a patch yet.  Another way this could be addressed is
>>> by the Tempest team changing the non-voting jobs causing this
>>> situation into voting jobs, which would prevent such changes from
>>> being merged in the first place.  The key issue here is that we need
>>> to be able to prioritize bugs based on what's most important to each
>>> project.
>>>
>>> We want to be clear that we appreciate the work the Tempest team does.
>>> We abhor bugs and want to squash them too.  The problem is just that
>>> we're stretched pretty thin with resources right now, and being forced
>>> to prioritize bug fixes that will get our gate un-blocked is
>>> interfering with our ability to work on issues that may have a higher
>>> impact on end users.
>>>
>>> The point of this email is to find out whether anyone has a better
>>> suggestion for how to handle this situation.
>>
>> It would be useful to provide detailed examples. Everything is trade
>> offs, and having the conversation in the abstract is very difficult to
>> understand those trade offs.
>>
>> -Sean
>>
>> --
>> Sean Dague
>> http://dague.net
>>
> 
> 
> ..1 http://lists.openstack.org/pipermail/openstack-dev/2017-May/116172.html
> 
> ..2 https://review.openstack.org/#/c/459774/ ,
> https://review.openstack.org/#/c/459445/
> 
> 
> -gmann
> 
>> __

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][qa][glance] some recent tempest problems

2017-06-16 Thread Sean Dague
On 06/16/2017 09:51 AM, Sean McGinnis wrote:
>>
>> It would be useful to provide detailed examples. Everything is trade
>> offs, and having the conversation in the abstract is very difficult to
>> understand those trade offs.
>>
>>  -Sean
>>
> 
> We've had this issue in Cinder and os-brick. Usually around Ceph, but if
> you follow the user survey, that's the most popular backend.
> 
> The problem we see is the tempest test that covers this is non-voting.
> And there have been several cases so far where this non-voting job does
> not pass, due to a legitimate failure, but the tempest patch merges anyway.
> 
> 
> To be fair, these failures usually do point out actual problems that need
> to be fixed. Not always, but at least in a few cases. But instead of it
> being addressed first to make sure there is no disruption, it's suddenly
> a blocking issue that holds up everything until it's either reverted, skipped,
> or the problem is resolved.
> 
> Here's one recent instance: https://review.openstack.org/#/c/471352/

Sure, if ceph is the primary concern, that feels like it should be a
reasonable specific thing to fix. It's not a grand issue, it's a
specific mismatch on what configs should be common.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][qa][glance] some recent tempest problems

2017-06-16 Thread Sean McGinnis
> 
> It would be useful to provide detailed examples. Everything is trade
> offs, and having the conversation in the abstract is very difficult to
> understand those trade offs.
> 
>   -Sean
> 

We've had this issue in Cinder and os-brick. Usually around Ceph, but if
you follow the user survey, that's the most popular backend.

The problem we see is the tempest test that covers this is non-voting.
And there have been several cases so far where this non-voting job does
not pass, due to a legitimate failure, but the tempest patch merges anyway.


To be fair, these failures usually do point out actual problems that need
to be fixed. Not always, but at least in a few cases. But instead of it
being addressed first to make sure there is no disruption, it's suddenly
a blocking issue that holds up everything until it's either reverted, skipped,
or the problem is resolved.

Here's one recent instance: https://review.openstack.org/#/c/471352/

Sean

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][qa][glance] some recent tempest problems

2017-06-15 Thread Ghanshyam Mann
On Fri, Jun 16, 2017 at 9:43 AM,  <zhu.fang...@zte.com.cn> wrote:
> https://review.openstack.org/#/c/471352/   may be an example

If this is case which is ceph related, i think we already discussed
these kind of cases, where functionality depends on backend storage
and how to handle corresponding tests failure [1].

Solution on that was Ceph job should exclude such test case which
functionality is not implemented/supported in ceph byregex. Jon
Bernard is working on this tests blacklist [2].

If there is any other job or case, then we can discuss/think of having
job running for Tempest gate also which i think we do in most cases.

And about making ceph job as voting, i remember we did not do that due
to stability ok job. Ceph job fails frequently and once Jon patches
merge and job is consistently stable then we can make voting.

>
>
> Original Mail
> Sender:  <s...@dague.net>;
> To:  <openstack-dev@lists.openstack.org>;
> Date: 2017/06/16 05:25
> Subject: Re: [openstack-dev] [all][qa][glance] some recent tempest problems
>
>
> On 06/15/2017 01:04 PM, Brian Rosmaita wrote:
>> This isn't a glance-specific problem though we've encountered it quite
>> a few times recently.
>>
>> Briefly, we're gating on Tempest jobs that tempest itself does not
>> gate on.  This leads to a situation where new tests can be merged in
>> tempest, but wind up breaking our gate. We aren't claiming that the
>> added tests are bad or don't provide value; the problem is that we
>> have to drop everything and fix the gate.  This interrupts our current
>> work and forces us to prioritize bugs to fix based not on what makes
>> the most sense for the project given current priorities and resources,
>> but based on whatever we can do to get the gates un-blocked.
>>
>> As we said earlier, this situation seems to be impacting multiple
>> projects.
>>
>> One solution for this is to change our gating so that we do not run
>> any Tempest jobs against Glance repositories that are not also gated
>> by Tempest.  That would in theory open a regression path, which is why
>> we haven't put up a patch yet.  Another way this could be addressed is
>> by the Tempest team changing the non-voting jobs causing this
>> situation into voting jobs, which would prevent such changes from
>> being merged in the first place.  The key issue here is that we need
>> to be able to prioritize bugs based on what's most important to each
>> project.
>>
>> We want to be clear that we appreciate the work the Tempest team does.
>> We abhor bugs and want to squash them too.  The problem is just that
>> we're stretched pretty thin with resources right now, and being forced
>> to prioritize bug fixes that will get our gate un-blocked is
>> interfering with our ability to work on issues that may have a higher
>> impact on end users.
>>
>> The point of this email is to find out whether anyone has a better
>> suggestion for how to handle this situation.
>
> It would be useful to provide detailed examples. Everything is trade
> offs, and having the conversation in the abstract is very difficult to
> understand those trade offs.
>
> -Sean
>
> --
> Sean Dague
> http://dague.net
>


..1 http://lists.openstack.org/pipermail/openstack-dev/2017-May/116172.html

..2 https://review.openstack.org/#/c/459774/ ,
https://review.openstack.org/#/c/459445/


-gmann

> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][qa][glance] some recent tempest problems

2017-06-15 Thread zhu.fanglei
https://review.openstack.org/#/c/471352/   may be an example






Original Mail



Sender:  <s...@dague.net>
To:  <openstack-dev@lists.openstack.org>
Date: 2017/06/16 05:25
Subject: Re: [openstack-dev] [all][qa][glance] some recent tempest problems





On 06/15/2017 01:04 PM, Brian Rosmaita wrote:
> This isn't a glance-specific problem though we've encountered it quite
> a few times recently.
> 
> Briefly, we're gating on Tempest jobs that tempest itself does not
> gate on.  This leads to a situation where new tests can be merged in
> tempest, but wind up breaking our gate. We aren't claiming that the
> added tests are bad or don't provide value the problem is that we
> have to drop everything and fix the gate.  This interrupts our current
> work and forces us to prioritize bugs to fix based not on what makes
> the most sense for the project given current priorities and resources,
> but based on whatever we can do to get the gates un-blocked.
> 
> As we said earlier, this situation seems to be impacting multiple projects.
> 
> One solution for this is to change our gating so that we do not run
> any Tempest jobs against Glance repositories that are not also gated
> by Tempest.  That would in theory open a regression path, which is why
> we haven't put up a patch yet.  Another way this could be addressed is
> by the Tempest team changing the non-voting jobs causing this
> situation into voting jobs, which would prevent such changes from
> being merged in the first place.  The key issue here is that we need
> to be able to prioritize bugs based on what's most important to each
> project.
> 
> We want to be clear that we appreciate the work the Tempest team does.
> We abhor bugs and want to squash them too.  The problem is just that
> we're stretched pretty thin with resources right now, and being forced
> to prioritize bug fixes that will get our gate un-blocked is
> interfering with our ability to work on issues that may have a higher
> impact on end users.
> 
> The point of this email is to find out whether anyone has a better
> suggestion for how to handle this situation.

It would be useful to provide detailed examples. Everything is trade
offs, and having the conversation in the abstract is very difficult to
understand those trade offs.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][qa][glance] some recent tempest problems

2017-06-15 Thread Sean Dague
On 06/15/2017 01:04 PM, Brian Rosmaita wrote:
> This isn't a glance-specific problem though we've encountered it quite
> a few times recently.
> 
> Briefly, we're gating on Tempest jobs that tempest itself does not
> gate on.  This leads to a situation where new tests can be merged in
> tempest, but wind up breaking our gate. We aren't claiming that the
> added tests are bad or don't provide value; the problem is that we
> have to drop everything and fix the gate.  This interrupts our current
> work and forces us to prioritize bugs to fix based not on what makes
> the most sense for the project given current priorities and resources,
> but based on whatever we can do to get the gates un-blocked.
> 
> As we said earlier, this situation seems to be impacting multiple projects.
> 
> One solution for this is to change our gating so that we do not run
> any Tempest jobs against Glance repositories that are not also gated
> by Tempest.  That would in theory open a regression path, which is why
> we haven't put up a patch yet.  Another way this could be addressed is
> by the Tempest team changing the non-voting jobs causing this
> situation into voting jobs, which would prevent such changes from
> being merged in the first place.  The key issue here is that we need
> to be able to prioritize bugs based on what's most important to each
> project.
> 
> We want to be clear that we appreciate the work the Tempest team does.
> We abhor bugs and want to squash them too.  The problem is just that
> we're stretched pretty thin with resources right now, and being forced
> to prioritize bug fixes that will get our gate un-blocked is
> interfering with our ability to work on issues that may have a higher
> impact on end users.
> 
> The point of this email is to find out whether anyone has a better
> suggestion for how to handle this situation.

It would be useful to provide detailed examples. Everything is trade
offs, and having the conversation in the abstract is very difficult to
understand those trade offs.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][qa][glance] some recent tempest problems

2017-06-15 Thread Doug Hellmann
Excerpts from Brian Rosmaita's message of 2017-06-15 13:04:39 -0400:
> This isn't a glance-specific problem though we've encountered it quite
> a few times recently.
> 
> Briefly, we're gating on Tempest jobs that tempest itself does not
> gate on.  This leads to a situation where new tests can be merged in
> tempest, but wind up breaking our gate. We aren't claiming that the
> added tests are bad or don't provide value; the problem is that we
> have to drop everything and fix the gate.  This interrupts our current
> work and forces us to prioritize bugs to fix based not on what makes
> the most sense for the project given current priorities and resources,
> but based on whatever we can do to get the gates un-blocked.
> 
> As we said earlier, this situation seems to be impacting multiple projects.
> 
> One solution for this is to change our gating so that we do not run
> any Tempest jobs against Glance repositories that are not also gated
> by Tempest.  That would in theory open a regression path, which is why
> we haven't put up a patch yet.  Another way this could be addressed is
> by the Tempest team changing the non-voting jobs causing this
> situation into voting jobs, which would prevent such changes from
> being merged in the first place.  The key issue here is that we need
> to be able to prioritize bugs based on what's most important to each
> project.
> 
> We want to be clear that we appreciate the work the Tempest team does.
> We abhor bugs and want to squash them too.  The problem is just that
> we're stretched pretty thin with resources right now, and being forced
> to prioritize bug fixes that will get our gate un-blocked is
> interfering with our ability to work on issues that may have a higher
> impact on end users.
> 
> The point of this email is to find out whether anyone has a better
> suggestion for how to handle this situation.
> 
> Thanks!
> 
> Erno Kuvaja
> Glance Release Czar
> 
> Brian Rosmaita
> Glance PTL
> 

Asymmetric gating definitely has a way of introducing these problems.

Which jobs are involved?

Doug

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev