Re: [openstack-dev] [heat] convergence cancel messages

2016-08-19 Thread Zane Bitter

On 19/08/16 09:55, Anant Patil wrote:


What I'm suggesting is very close to that:

(1) stack-cancel-update  will start another update using the
previous template/environment. We'll start rolling back; in-progress
resources will be allowed to complete normally.
(2) stack-cancel-update  --no-rollback will set the
traversal_id to None so no further resources will be updated;
in-progress resources will be allowed to complete normally.
(3) resource-mark-unhealthy   ... 
Kill any threads running a CREATE or UPDATE on the given resources, mark
as CHECK_FAILED if they are not already in UPDATE_FAILED, don't do
anything else. If the resource was in progress, the stack won't progress
further, other resources currently in-progress will complete, and if
rollback is enabled and no other traversal has started then it will roll
back to the previous template/environment.

I have started implementation of the above three mechanisms. The first
two are implemented in https://review.openstack.org/#/c/357618


This looks great, thanks! That covers both our internal use of 
update-cancel and the current user API update-cancel nicely.



Note that the (2) needs a change in heat client (openstack client?) to
have a --no-rollback option.


Yeah, and also a (very minor) REST API change. I'd be in favour of 
trying to get this in before Newton FF, it'd be really useful to have.



(3) is a bit of long haul, and needs:
https://review.openstack.org/343076 : Adds mechanism to interrupt
convergence worker threads
https://review.openstack.org/301483 : Mechanism to send cancel message
and cancel worker upon receiving messages


Another thing I forgot is that when we delete a stack, we cancel all the 
threads working on it, so that any in-progress update/create used to be 
stopped (you're about to delete that stuff anyway, so you might as well 
not bother with anything else), and the lack of this functionality in 
convergence is causing problems for some users. It looks like this patch 
is intended to build on the previous two to resolve that:


https://review.openstack.org/#/c/354000/

(This is actually going to be much better than the old behaviour, 
because it turned out that cancelling threads was very much not the 
right thing to do, and it's much better to stop them at a yield point.)


So I think all of the above apart from the API/client change for (2) are 
going to be critical to land for Newton. (They're all in a sense bugs at 
the moment.)



Apart from the above two, I am implementing the actual patch which will
leverage the above two to complete resource-mark-unhealthy feature in
convergence.


Great! Hopefully people will rarely need this, but it'll be much more 
comfortable unleashing convergence on the world if we know that this 
exists as a circuit-breaker in case something does get stuck.


Let me know if I can help with any of this stuff without stepping on any 
toes (time zones unfortunately make it hard for you and I to 
co-ordinate). I'll at least try to circle back regularly to the reviews.


cheers,
Zane.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [heat] convergence cancel messages

2016-08-19 Thread Anant Patil
On Tue, Apr 19, 2016 at 9:36 PM Zane Bitter  wrote:

> On 17/04/16 00:44, Anant Patil wrote:
> > I think it is a good idea, but I see that a resource can be
> marked
> > unhealthy only after it is done.
> >
> >
> > Currently, yes. The idea would be to change that so that if it finds
> > the resource IN_PROGRESS then it kills the thread and makes sure the
> > resource is in a FAILED state. I
> >
> >
> > Move the resource to CHECK_FAILED?
>
> I'd say that if killing the thread gets it to UPDATE_FAILED then Mission
> Accomplished, but obviously we'd have to check for races and make sure
> we move it to CHECK_FAILED if the update completes successfully.
>
> > The trick would be if the stack update is still running and the
> > resource is currently IN_PROGRESS to make sure that we fail the
> > whole stack update (rolling back if the user has enabled that).
> >
> >
> > IMO, we can probably use the cancel  command do this, because when you
> > are marking a resource as unhealthy, you are
> > cancelling any action running on that resource. Would the following be
> ok?
> > (1) stack-cancel-update  will cancel the update, mark
> > cancelled resources failed and rollback (existing stuff)
> > (2) stack-cancel-update  --no-rollback will just cancel the
> > update and mark cancelled resources as failed
> > (3) stack-cancel-update   ...  Just
> > stop the action on given resources, mark as CHECK_FAILED, don't do
> > anything else. The stack won't progress further. Other resources running
> > while cancel-update will complete.
>
> None of those solve the use case I actually care about, which is "don't
> start any more resource updates, but don't mark the ones currently
> in-progress as failed either, and don't roll back". That would be a huge
> help in TripleO. We need a way to be able to stop updates that
> guarantees not unnecessarily destroying any part of the existing stack,
> and we need that to be the default.
>
> (We sort-of have the rollback version of this; it's equivalent to a
> stack update with the previous template/environment. But we need to make
> it easier and decouple it from the rollback IMHO.)
>
> So one way to do this would be:
>
> (1) stack-cancel-update  will start another update using the
> previous template/environment. We'll start rolling back; in-progress
> resources will be allowed to complete normally.
> (2) stack-cancel-update  --no-rollback will set the
> traversal_id to None so no further resources will be updated;
> in-progress resources will be allowed to complete normally.
> (3) stack-cancel-update  --stop-in-progress will stop the
> traversal, kill any running threads update (marking cancelled resources
> failed) and rollback
> (4) stack-cancel-update  --stop-in-progress --no-rollback will
> just stop the traversal, kill any running threads update (marking
> cancelled resources failed)
> (5) stack-cancel-update  --stop-in-progress  ...
>  Just stop the action on given resources, mark as
> UPDATE_FAILED, don't do anything else. The stack won't progress further.
> Other resources running while cancel-update will complete.
>
> That would cover all the use cases. Some problems with it are:
> - It's way complicated. Lots of options.
> - Those options don't translate well to legacy (pre-convergence) stacks
> using the same client. e.g. there is now a non-default
> --stop-in-progress option, but on legacy stacks we always stop in-progress.
> - Options don't commute. When you specify resources with the
> --stop-in-progress flag it never rolls back, even though you haven't set
> the --no-rollback flag.
>
> An alternative would be to just drop (3) and (4), and maybe rename (5).
> I'd be OK with that:
>
> (1) stack-cancel-update  will start another update using the
> previous template/environment. We'll start rolling back; in-progress
> resources will be allowed to complete normally.
> (2) stack-cancel-update  --no-rollback will set the
> traversal_id to None so no further resources will be updated;
> in-progress resources will be allowed to complete normally.
> (3) resource-stop-update   ...  Just
> stop the action on given resources, mark as UPDATE_FAILED, don't do
> anything else. The stack won't progress further. Other resources running
> while cancel-update will complete.
>
> That solves most of the issues, except that (3) has no real equivalent
> on legacy stacks (I guess we could just make it fail on the server side).
>
> What I'm suggesting is very close to that:
>
> (1) stack-cancel-update  will start another update using the
> previous template/environment. We'll start rolling back; in-progress
> resources will be allowed to complete normally.
> (2) stack-cancel-update  --no-rollback will set the
> traversal_id to None so no further resources will be updated;
> in-progress resources will be allowed to complete normally.
> (3) resource-mark-unhealthy   ... 
> Kill any threads running a CREATE or UPDATE on the given resources, 

Re: [openstack-dev] [heat] convergence cancel messages

2016-04-19 Thread Zane Bitter

On 17/04/16 00:44, Anant Patil wrote:

I think it is a good idea, but I see that a resource can be marked
unhealthy only after it is done.


Currently, yes. The idea would be to change that so that if it finds
the resource IN_PROGRESS then it kills the thread and makes sure the
resource is in a FAILED state. I


Move the resource to CHECK_FAILED?


I'd say that if killing the thread gets it to UPDATE_FAILED then Mission 
Accomplished, but obviously we'd have to check for races and make sure 
we move it to CHECK_FAILED if the update completes successfully.



The trick would be if the stack update is still running and the
resource is currently IN_PROGRESS to make sure that we fail the
whole stack update (rolling back if the user has enabled that).


IMO, we can probably use the cancel  command do this, because when you
are marking a resource as unhealthy, you are
cancelling any action running on that resource. Would the following be ok?
(1) stack-cancel-update  will cancel the update, mark
cancelled resources failed and rollback (existing stuff)
(2) stack-cancel-update  --no-rollback will just cancel the
update and mark cancelled resources as failed
(3) stack-cancel-update   ...  Just
stop the action on given resources, mark as CHECK_FAILED, don't do
anything else. The stack won't progress further. Other resources running
while cancel-update will complete.


None of those solve the use case I actually care about, which is "don't 
start any more resource updates, but don't mark the ones currently 
in-progress as failed either, and don't roll back". That would be a huge 
help in TripleO. We need a way to be able to stop updates that 
guarantees not unnecessarily destroying any part of the existing stack, 
and we need that to be the default.


(We sort-of have the rollback version of this; it's equivalent to a 
stack update with the previous template/environment. But we need to make 
it easier and decouple it from the rollback IMHO.)


So one way to do this would be:

(1) stack-cancel-update  will start another update using the 
previous template/environment. We'll start rolling back; in-progress 
resources will be allowed to complete normally.
(2) stack-cancel-update  --no-rollback will set the 
traversal_id to None so no further resources will be updated; 
in-progress resources will be allowed to complete normally.
(3) stack-cancel-update  --stop-in-progress will stop the 
traversal, kill any running threads update (marking cancelled resources 
failed) and rollback
(4) stack-cancel-update  --stop-in-progress --no-rollback will 
just stop the traversal, kill any running threads update (marking 
cancelled resources failed)
(5) stack-cancel-update  --stop-in-progress  ... 
 Just stop the action on given resources, mark as 
UPDATE_FAILED, don't do anything else. The stack won't progress further. 
Other resources running while cancel-update will complete.


That would cover all the use cases. Some problems with it are:
- It's way complicated. Lots of options.
- Those options don't translate well to legacy (pre-convergence) stacks 
using the same client. e.g. there is now a non-default 
--stop-in-progress option, but on legacy stacks we always stop in-progress.
- Options don't commute. When you specify resources with the 
--stop-in-progress flag it never rolls back, even though you haven't set 
the --no-rollback flag.


An alternative would be to just drop (3) and (4), and maybe rename (5). 
I'd be OK with that:


(1) stack-cancel-update  will start another update using the 
previous template/environment. We'll start rolling back; in-progress 
resources will be allowed to complete normally.
(2) stack-cancel-update  --no-rollback will set the 
traversal_id to None so no further resources will be updated; 
in-progress resources will be allowed to complete normally.
(3) resource-stop-update   ...  Just 
stop the action on given resources, mark as UPDATE_FAILED, don't do 
anything else. The stack won't progress further. Other resources running 
while cancel-update will complete.


That solves most of the issues, except that (3) has no real equivalent 
on legacy stacks (I guess we could just make it fail on the server side).


What I'm suggesting is very close to that:

(1) stack-cancel-update  will start another update using the 
previous template/environment. We'll start rolling back; in-progress 
resources will be allowed to complete normally.
(2) stack-cancel-update  --no-rollback will set the 
traversal_id to None so no further resources will be updated; 
in-progress resources will be allowed to complete normally.
(3) resource-mark-unhealthy   ...  
Kill any threads running a CREATE or UPDATE on the given resources, mark 
as CHECK_FAILED if they are not already in UPDATE_FAILED, don't do 
anything else. If the resource was in progress, the stack won't progress 
further, other resources currently in-progress will complete, and if 
rollback is enabled and no other traversal 

Re: [openstack-dev] [heat] convergence cancel messages

2016-04-16 Thread Anant Patil
On Sat, Apr 16, 2016 at 1:18 AM, Zane Bitter  wrote:

> On 15/04/16 10:58, Anant Patil wrote:
>
>> On 14-Apr-16 23:09, Zane Bitter wrote:
>>
>>> On 11/04/16 04:51, Anant Patil wrote:
>>>
 After lot of ping-pong in my head, I have taken a different approach to
 implement stack-update-cancel when convergence is on. Polling for
 traversal update in each heat engine worker is not efficient method and
 so is the broadcasting method.

 In the new implementation, when a stack-cancel-update request is
 received, the heat engine worker will immediately cancel eventlets
 running locally for the stack. Then it sends cancel messages to only
 those heat engines who are working on the stack, one request per engine.

>>>
>>> I'm concerned that this is forgetting the reason we didn't implement
>>> this in convergence in the first place. The purpose of
>>> stack-cancel-update is to roll the stack back to its pre-update state,
>>> not to unwedge blocked resources.
>>>
>>>
>> Yes, we thought this was never needed because we consciously decided
>> that the concurrent update feature would suffice the needs of user.
>> Exactly the reason for me to implement this so late. But there were
>> questions for API compatibility, and what if user really wants to cancel
>> the update, given that he/she knows the consequence of it?
>>
>
> Cool, we are on the same page then :)
>
> The problem with just killing a thread is that the resource gets left in
>>> an unknown state. (It's slightly less dangerous if you do it only during
>>> sleeps, but still the state is indeterminate.) As a result, we mark all
>>> such resources UPDATE_FAILED, and anything (apart from nested stacks) in
>>> a FAILED state is liable to be replaced on the next update (straight
>>> away in the case of a rollback). That's why in convergence we just let
>>> resources run their course rather than cancelling them, and of course we
>>> are able to do so because they don't block other operations on the stack
>>> until they reach the point of needing to operate on that particular
>>> resource.
>>>
>>>
>> The eventlet returns after each "step", so it's not that bad, but I do
>>
>
> Yeah, I saw you implemented it that way, and this is a *big* improvement.
> That will help avoid bugs like
> http://lists.openstack.org/pipermail/openstack-dev/2016-January/084467.html
>
> agree that the resource might not be in a state from where it can
>> "resume", and hence the update-replace.
>>
>
> The issue is that Heat *always* moves the resource to FAILED and therefore
> it is *always* replaced in the future, even if it would have completed fine.
>
> So doing some trivial change that is guaranteed to happen in-place could
> result in your critical resource that must never be replaced (e.g. Cinder
> volume) being replaced if you happen to cancel the update at just the wrong
> moment.


I agree with you for the need to have a mechanism to just stop doing the
update (or whatever heat was doing to that resource :))

>
>
I acknowledge your concern here,
>> but I see that the user really knows that the stack is stuck because of
>> some unexpected failure which heat is not aware of, and wants to cancel
>> it.
>>
>
> I think there's two different use cases here: (1) just stop the update and
> don't start updating any more resources (and maybe roll back what has
> already been done); and (2) kill the update on this resource that is stuck.
> Using the same command for both is likely to cause trouble for people who
> were only wanting the first one.
>
> The other option would be to have stack-cancel-update just do (1) by
> default, but add a --cancel-me-harder option that also does (2).


>
> That leaves the problem of what to do when you _know_ a resource is
>>> going to fail, you _want_ to replace it, and you don't want to wait for
>>> the stack timeout. (In theory this problem will go away when Phase 2 of
>>> convergence is fully implemented, but I agree we need a solution for
>>> Phase 1.) Now that we have the mark-unhealthy API,[1] that seems to me
>>> like a better candidate for the functionality to stop threads than
>>> stack-cancel-update is, since its entire purpose in life is to set a
>>> resource into a FAILED state so that it will get replaced on the next
>>> stack update.
>>>
>>> So from a user's perspective, they would issue stack-cancel-update to
>>> start the rollback, and iff that gets stuck waiting on a resource that
>>> is doomed to fail eventually and which they just want to replace, they
>>> can issue resource-mark-unhealthy to just stop that resource.
>>>
>>>
>> I was thinking of having the rollback optional while cancelling the
>> update. The user may want to cancel the update and issue a new one, but
>> not rollback.
>>
>
> +1, this is a good idea. I originally thought that you'd never want to
> leave the stack in an intermediate state, but experience with TripleO
> (which can't really do rollbacks) is that sometimes 

Re: [openstack-dev] [heat] convergence cancel messages

2016-04-15 Thread Zane Bitter

On 15/04/16 10:58, Anant Patil wrote:

On 14-Apr-16 23:09, Zane Bitter wrote:

On 11/04/16 04:51, Anant Patil wrote:

After lot of ping-pong in my head, I have taken a different approach to
implement stack-update-cancel when convergence is on. Polling for
traversal update in each heat engine worker is not efficient method and
so is the broadcasting method.

In the new implementation, when a stack-cancel-update request is
received, the heat engine worker will immediately cancel eventlets
running locally for the stack. Then it sends cancel messages to only
those heat engines who are working on the stack, one request per engine.


I'm concerned that this is forgetting the reason we didn't implement
this in convergence in the first place. The purpose of
stack-cancel-update is to roll the stack back to its pre-update state,
not to unwedge blocked resources.



Yes, we thought this was never needed because we consciously decided
that the concurrent update feature would suffice the needs of user.
Exactly the reason for me to implement this so late. But there were
questions for API compatibility, and what if user really wants to cancel
the update, given that he/she knows the consequence of it?


Cool, we are on the same page then :)


The problem with just killing a thread is that the resource gets left in
an unknown state. (It's slightly less dangerous if you do it only during
sleeps, but still the state is indeterminate.) As a result, we mark all
such resources UPDATE_FAILED, and anything (apart from nested stacks) in
a FAILED state is liable to be replaced on the next update (straight
away in the case of a rollback). That's why in convergence we just let
resources run their course rather than cancelling them, and of course we
are able to do so because they don't block other operations on the stack
until they reach the point of needing to operate on that particular
resource.



The eventlet returns after each "step", so it's not that bad, but I do


Yeah, I saw you implemented it that way, and this is a *big* 
improvement. That will help avoid bugs like 
http://lists.openstack.org/pipermail/openstack-dev/2016-January/084467.html



agree that the resource might not be in a state from where it can
"resume", and hence the update-replace.


The issue is that Heat *always* moves the resource to FAILED and 
therefore it is *always* replaced in the future, even if it would have 
completed fine.


So doing some trivial change that is guaranteed to happen in-place could 
result in your critical resource that must never be replaced (e.g. 
Cinder volume) being replaced if you happen to cancel the update at just 
the wrong moment.



I acknowledge your concern here,
but I see that the user really knows that the stack is stuck because of
some unexpected failure which heat is not aware of, and wants to cancel
it.


I think there's two different use cases here: (1) just stop the update 
and don't start updating any more resources (and maybe roll back what 
has already been done); and (2) kill the update on this resource that is 
stuck. Using the same command for both is likely to cause trouble for 
people who were only wanting the first one.


The other option would be to have stack-cancel-update just do (1) by 
default, but add a --cancel-me-harder option that also does (2).



That leaves the problem of what to do when you _know_ a resource is
going to fail, you _want_ to replace it, and you don't want to wait for
the stack timeout. (In theory this problem will go away when Phase 2 of
convergence is fully implemented, but I agree we need a solution for
Phase 1.) Now that we have the mark-unhealthy API,[1] that seems to me
like a better candidate for the functionality to stop threads than
stack-cancel-update is, since its entire purpose in life is to set a
resource into a FAILED state so that it will get replaced on the next
stack update.

So from a user's perspective, they would issue stack-cancel-update to
start the rollback, and iff that gets stuck waiting on a resource that
is doomed to fail eventually and which they just want to replace, they
can issue resource-mark-unhealthy to just stop that resource.



I was thinking of having the rollback optional while cancelling the
update. The user may want to cancel the update and issue a new one, but
not rollback.


+1, this is a good idea. I originally thought that you'd never want to 
leave the stack in an intermediate state, but experience with TripleO 
(which can't really do rollbacks) is that sometimes you really do just 
want to hit the panic button and stop the world :D



What do you think?



I think it is a good idea, but I see that a resource can be marked
unhealthy only after it is done.


Currently, yes. The idea would be to change that so that if it finds the 
resource IN_PROGRESS then it kills the thread and makes sure the 
resource is in a FAILED state. I imagine/hope it wouldn't require big 
changes to your patch, mostly just changing where it's triggered 

Re: [openstack-dev] [heat] convergence cancel messages

2016-04-15 Thread Anant Patil
On 14-Apr-16 23:09, Zane Bitter wrote:
> On 11/04/16 04:51, Anant Patil wrote:
>> On 14-Mar-16 14:40, Anant Patil wrote:
>>> On 24-Feb-16 22:48, Clint Byrum wrote:
 Excerpts from Anant Patil's message of 2016-02-23 23:08:31 -0800:
> Hi,
>
> I would like the discuss various approaches towards fixing bug
> https://launchpad.net/bugs/1533176
>
> When convergence is on, and if the stack is stuck, there is no way to
> cancel the existing request. This feature was not implemented in
> convergence, as the user can again issue an update on an in-progress
> stack. But if a resource worker is stuck, the new update will wait
> for-ever on it and the update will not be effective.
>
> The solution is to implement cancel request. Since the work for a stack
> is distributed among heat engines, the cancel request will not work as
> it does in legacy way. Many or all of the heat engines might be running
> worker threads to provision a stack.
>
> I could think of two options which I would like to discuss:
>
> (a) When a user triggered cancel request is received, set the stack
> current traversal to None or something else other than current
> traversal. With this the new check-resources/workers will never be
> triggered. This is okay as long as the worker(s) is not stuck. The
> existing workers will finish running, and no new check-resource
> (workers) will be triggered, and it will be a graceful cancel.  But the
> workers that are stuck will be stuck for-ever till stack times-out.  To
> take care of such cases, we will have to implement logic of "polling"
> the DB at regular intervals (may be at each step() of scheduler task)
> and bail out if the current traversal is updated. Basically, each worker
> will "poll" the DB to see if the current traversal is still valid and if
> not, stop itself. The drawback of this approach is that all the workers
> will be hitting the DB and incur a significant overhead.  Besides, all
> the stack workers irrespective of whether they will be cancelled or not,
> will keep on hitting DB. The advantage is that it probably is easier to
> implement. Also, if the worker is stuck in particular "step", then this
> approach will not work.
>
> (b) Another approach is to send cancel message to all the heat engines
> when one receives a stack cancel request. The idea is to use the thread
> group manager in each engine to keep track of threads running for a
> stack, and stop the thread group when a cancel message is received. The
> advantage is that the messages to cancel stack workers is sent only when
> required and there is no other over-head. The draw-back is that the
> cancel message is 'broadcasted' to all heat engines, even if they are
> not running any workers for the given stack, though, in such cases, it
> will be a just no-op for the heat-engine (the message will be gracefully
> discarded).
 Oh hah, I just sent (b) as an option to avoid (a) without really
 thinking about (b) again.

 I don't think the cancel broadcasts are all that much of a drawback. I
 do think you need to rate limit cancels though, or you give users the
 chance to DDoS the system.
>>> There is no easier way to restrict the cancels, so I am choosing the
>>> option of having a "monitoring task" which runs in separate thread. This
>>> task periodically polls DB to check if the current traversal is updated.
>>> When a cancel message is received, the current traversal is updated to
>>> new id and monitoring task will stop the thread group running worker
>>> threads for previous traversal (traversal uniquely identifies a stack
>>> operation).
>>>
>>> Also, this will help with checking timeout. Currently each worker checks
>>> for timeout.  I can move this to the monitoring thread which will stop
>>> the thread group when stack times out.
>>>
>>> It is better to restrict the actions within the heat engine than to load
>>> the AMQP; that can lead to potentially complicated issues.
>>>
>>> -- Anant
>> I almost forgot to update this thread.
>>
>> After lot of ping-pong in my head, I have taken a different approach to
>> implement stack-update-cancel when convergence is on. Polling for
>> traversal update in each heat engine worker is not efficient method and
>> so is the broadcasting method.
>>
>> In the new implementation, when a stack-cancel-update request is
>> received, the heat engine worker will immediately cancel eventlets
>> running locally for the stack. Then it sends cancel messages to only
>> those heat engines who are working on the stack, one request per engine.
> 
> I'm concerned that this is forgetting the reason we didn't implement 
> this in convergence in the first place. The purpose of 
> stack-cancel-update is to roll the stack back to its pre-update state, 
> not to unwedge blocked resources.
> 

Yes, we 

Re: [openstack-dev] [heat] convergence cancel messages

2016-04-14 Thread Zane Bitter

On 11/04/16 04:51, Anant Patil wrote:

On 14-Mar-16 14:40, Anant Patil wrote:

On 24-Feb-16 22:48, Clint Byrum wrote:

Excerpts from Anant Patil's message of 2016-02-23 23:08:31 -0800:

Hi,

I would like the discuss various approaches towards fixing bug
https://launchpad.net/bugs/1533176

When convergence is on, and if the stack is stuck, there is no way to
cancel the existing request. This feature was not implemented in
convergence, as the user can again issue an update on an in-progress
stack. But if a resource worker is stuck, the new update will wait
for-ever on it and the update will not be effective.

The solution is to implement cancel request. Since the work for a stack
is distributed among heat engines, the cancel request will not work as
it does in legacy way. Many or all of the heat engines might be running
worker threads to provision a stack.

I could think of two options which I would like to discuss:

(a) When a user triggered cancel request is received, set the stack
current traversal to None or something else other than current
traversal. With this the new check-resources/workers will never be
triggered. This is okay as long as the worker(s) is not stuck. The
existing workers will finish running, and no new check-resource
(workers) will be triggered, and it will be a graceful cancel.  But the
workers that are stuck will be stuck for-ever till stack times-out.  To
take care of such cases, we will have to implement logic of "polling"
the DB at regular intervals (may be at each step() of scheduler task)
and bail out if the current traversal is updated. Basically, each worker
will "poll" the DB to see if the current traversal is still valid and if
not, stop itself. The drawback of this approach is that all the workers
will be hitting the DB and incur a significant overhead.  Besides, all
the stack workers irrespective of whether they will be cancelled or not,
will keep on hitting DB. The advantage is that it probably is easier to
implement. Also, if the worker is stuck in particular "step", then this
approach will not work.

(b) Another approach is to send cancel message to all the heat engines
when one receives a stack cancel request. The idea is to use the thread
group manager in each engine to keep track of threads running for a
stack, and stop the thread group when a cancel message is received. The
advantage is that the messages to cancel stack workers is sent only when
required and there is no other over-head. The draw-back is that the
cancel message is 'broadcasted' to all heat engines, even if they are
not running any workers for the given stack, though, in such cases, it
will be a just no-op for the heat-engine (the message will be gracefully
discarded).

Oh hah, I just sent (b) as an option to avoid (a) without really
thinking about (b) again.

I don't think the cancel broadcasts are all that much of a drawback. I
do think you need to rate limit cancels though, or you give users the
chance to DDoS the system.

There is no easier way to restrict the cancels, so I am choosing the
option of having a "monitoring task" which runs in separate thread. This
task periodically polls DB to check if the current traversal is updated.
When a cancel message is received, the current traversal is updated to
new id and monitoring task will stop the thread group running worker
threads for previous traversal (traversal uniquely identifies a stack
operation).

Also, this will help with checking timeout. Currently each worker checks
for timeout.  I can move this to the monitoring thread which will stop
the thread group when stack times out.

It is better to restrict the actions within the heat engine than to load
the AMQP; that can lead to potentially complicated issues.

-- Anant

I almost forgot to update this thread.

After lot of ping-pong in my head, I have taken a different approach to
implement stack-update-cancel when convergence is on. Polling for
traversal update in each heat engine worker is not efficient method and
so is the broadcasting method.

In the new implementation, when a stack-cancel-update request is
received, the heat engine worker will immediately cancel eventlets
running locally for the stack. Then it sends cancel messages to only
those heat engines who are working on the stack, one request per engine.


I'm concerned that this is forgetting the reason we didn't implement 
this in convergence in the first place. The purpose of 
stack-cancel-update is to roll the stack back to its pre-update state, 
not to unwedge blocked resources.


The problem with just killing a thread is that the resource gets left in 
an unknown state. (It's slightly less dangerous if you do it only during 
sleeps, but still the state is indeterminate.) As a result, we mark all 
such resources UPDATE_FAILED, and anything (apart from nested stacks) in 
a FAILED state is liable to be replaced on the next update (straight 
away in the case of a rollback). That's why in convergence we just let 
resources run 

Re: [openstack-dev] [heat] convergence cancel messages

2016-04-11 Thread Anant Patil
On 14-Mar-16 14:40, Anant Patil wrote:
> On 24-Feb-16 22:48, Clint Byrum wrote:
>> Excerpts from Anant Patil's message of 2016-02-23 23:08:31 -0800:
>>> Hi,
>>>
>>> I would like the discuss various approaches towards fixing bug
>>> https://launchpad.net/bugs/1533176
>>>
>>> When convergence is on, and if the stack is stuck, there is no way to
>>> cancel the existing request. This feature was not implemented in
>>> convergence, as the user can again issue an update on an in-progress
>>> stack. But if a resource worker is stuck, the new update will wait
>>> for-ever on it and the update will not be effective.
>>>
>>> The solution is to implement cancel request. Since the work for a stack
>>> is distributed among heat engines, the cancel request will not work as
>>> it does in legacy way. Many or all of the heat engines might be running
>>> worker threads to provision a stack.
>>>
>>> I could think of two options which I would like to discuss:
>>>
>>> (a) When a user triggered cancel request is received, set the stack
>>> current traversal to None or something else other than current
>>> traversal. With this the new check-resources/workers will never be
>>> triggered. This is okay as long as the worker(s) is not stuck. The
>>> existing workers will finish running, and no new check-resource
>>> (workers) will be triggered, and it will be a graceful cancel.  But the
>>> workers that are stuck will be stuck for-ever till stack times-out.  To
>>> take care of such cases, we will have to implement logic of "polling"
>>> the DB at regular intervals (may be at each step() of scheduler task)
>>> and bail out if the current traversal is updated. Basically, each worker
>>> will "poll" the DB to see if the current traversal is still valid and if
>>> not, stop itself. The drawback of this approach is that all the workers
>>> will be hitting the DB and incur a significant overhead.  Besides, all
>>> the stack workers irrespective of whether they will be cancelled or not,
>>> will keep on hitting DB. The advantage is that it probably is easier to
>>> implement. Also, if the worker is stuck in particular "step", then this
>>> approach will not work.
>>>
>>> (b) Another approach is to send cancel message to all the heat engines
>>> when one receives a stack cancel request. The idea is to use the thread
>>> group manager in each engine to keep track of threads running for a
>>> stack, and stop the thread group when a cancel message is received. The
>>> advantage is that the messages to cancel stack workers is sent only when
>>> required and there is no other over-head. The draw-back is that the
>>> cancel message is 'broadcasted' to all heat engines, even if they are
>>> not running any workers for the given stack, though, in such cases, it
>>> will be a just no-op for the heat-engine (the message will be gracefully
>>> discarded).
>> Oh hah, I just sent (b) as an option to avoid (a) without really
>> thinking about (b) again.
>>
>> I don't think the cancel broadcasts are all that much of a drawback. I
>> do think you need to rate limit cancels though, or you give users the
>> chance to DDoS the system.
> There is no easier way to restrict the cancels, so I am choosing the
> option of having a "monitoring task" which runs in separate thread. This
> task periodically polls DB to check if the current traversal is updated.
> When a cancel message is received, the current traversal is updated to
> new id and monitoring task will stop the thread group running worker
> threads for previous traversal (traversal uniquely identifies a stack
> operation).
>
> Also, this will help with checking timeout. Currently each worker checks
> for timeout.  I can move this to the monitoring thread which will stop
> the thread group when stack times out.
>
> It is better to restrict the actions within the heat engine than to load
> the AMQP; that can lead to potentially complicated issues.
>
> -- Anant
I almost forgot to update this thread.

After lot of ping-pong in my head, I have taken a different approach to
implement stack-update-cancel when convergence is on. Polling for
traversal update in each heat engine worker is not efficient method and
so is the broadcasting method.

In the new implementation, when a stack-cancel-update request is
received, the heat engine worker will immediately cancel eventlets
running locally for the stack. Then it sends cancel messages to only
those heat engines who are working on the stack, one request per engine.

Please review the patch: https://review.openstack.org/#/c/301483/

-- Anant

>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
> __
> OpenStack Development Mailing List (not for usage 

Re: [openstack-dev] [heat] convergence cancel messages

2016-03-14 Thread Anant Patil
On 24-Feb-16 22:48, Clint Byrum wrote:
> Excerpts from Anant Patil's message of 2016-02-23 23:08:31 -0800:
>> Hi,
>>
>> I would like the discuss various approaches towards fixing bug
>> https://launchpad.net/bugs/1533176
>>
>> When convergence is on, and if the stack is stuck, there is no way to
>> cancel the existing request. This feature was not implemented in
>> convergence, as the user can again issue an update on an in-progress
>> stack. But if a resource worker is stuck, the new update will wait
>> for-ever on it and the update will not be effective.
>>
>> The solution is to implement cancel request. Since the work for a stack
>> is distributed among heat engines, the cancel request will not work as
>> it does in legacy way. Many or all of the heat engines might be running
>> worker threads to provision a stack.
>>
>> I could think of two options which I would like to discuss:
>>
>> (a) When a user triggered cancel request is received, set the stack
>> current traversal to None or something else other than current
>> traversal. With this the new check-resources/workers will never be
>> triggered. This is okay as long as the worker(s) is not stuck. The
>> existing workers will finish running, and no new check-resource
>> (workers) will be triggered, and it will be a graceful cancel.  But the
>> workers that are stuck will be stuck for-ever till stack times-out.  To
>> take care of such cases, we will have to implement logic of "polling"
>> the DB at regular intervals (may be at each step() of scheduler task)
>> and bail out if the current traversal is updated. Basically, each worker
>> will "poll" the DB to see if the current traversal is still valid and if
>> not, stop itself. The drawback of this approach is that all the workers
>> will be hitting the DB and incur a significant overhead.  Besides, all
>> the stack workers irrespective of whether they will be cancelled or not,
>> will keep on hitting DB. The advantage is that it probably is easier to
>> implement. Also, if the worker is stuck in particular "step", then this
>> approach will not work.
>>
>> (b) Another approach is to send cancel message to all the heat engines
>> when one receives a stack cancel request. The idea is to use the thread
>> group manager in each engine to keep track of threads running for a
>> stack, and stop the thread group when a cancel message is received. The
>> advantage is that the messages to cancel stack workers is sent only when
>> required and there is no other over-head. The draw-back is that the
>> cancel message is 'broadcasted' to all heat engines, even if they are
>> not running any workers for the given stack, though, in such cases, it
>> will be a just no-op for the heat-engine (the message will be gracefully
>> discarded).
> Oh hah, I just sent (b) as an option to avoid (a) without really
> thinking about (b) again.
>
> I don't think the cancel broadcasts are all that much of a drawback. I
> do think you need to rate limit cancels though, or you give users the
> chance to DDoS the system.
There is no easier way to restrict the cancels, so I am choosing the
option of having a "monitoring task" which runs in separate thread. This
task periodically polls DB to check if the current traversal is updated.
When a cancel message is received, the current traversal is updated to
new id and monitoring task will stop the thread group running worker
threads for previous traversal (traversal uniquely identifies a stack
operation).

Also, this will help with checking timeout. Currently each worker checks
for timeout.  I can move this to the monitoring thread which will stop
the thread group when stack times out.

It is better to restrict the actions within the heat engine than to load
the AMQP; that can lead to potentially complicated issues.

-- Anant

> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [heat] convergence cancel messages

2016-02-24 Thread Clint Byrum
Excerpts from Anant Patil's message of 2016-02-23 23:08:31 -0800:
> Hi,
> 
> I would like the discuss various approaches towards fixing bug
> https://launchpad.net/bugs/1533176
> 
> When convergence is on, and if the stack is stuck, there is no way to
> cancel the existing request. This feature was not implemented in
> convergence, as the user can again issue an update on an in-progress
> stack. But if a resource worker is stuck, the new update will wait
> for-ever on it and the update will not be effective.
> 
> The solution is to implement cancel request. Since the work for a stack
> is distributed among heat engines, the cancel request will not work as
> it does in legacy way. Many or all of the heat engines might be running
> worker threads to provision a stack.
> 
> I could think of two options which I would like to discuss:
> 
> (a) When a user triggered cancel request is received, set the stack
> current traversal to None or something else other than current
> traversal. With this the new check-resources/workers will never be
> triggered. This is okay as long as the worker(s) is not stuck. The
> existing workers will finish running, and no new check-resource
> (workers) will be triggered, and it will be a graceful cancel.  But the
> workers that are stuck will be stuck for-ever till stack times-out.  To
> take care of such cases, we will have to implement logic of "polling"
> the DB at regular intervals (may be at each step() of scheduler task)
> and bail out if the current traversal is updated. Basically, each worker
> will "poll" the DB to see if the current traversal is still valid and if
> not, stop itself. The drawback of this approach is that all the workers
> will be hitting the DB and incur a significant overhead.  Besides, all
> the stack workers irrespective of whether they will be cancelled or not,
> will keep on hitting DB. The advantage is that it probably is easier to
> implement. Also, if the worker is stuck in particular "step", then this
> approach will not work.
> 
> (b) Another approach is to send cancel message to all the heat engines
> when one receives a stack cancel request. The idea is to use the thread
> group manager in each engine to keep track of threads running for a
> stack, and stop the thread group when a cancel message is received. The
> advantage is that the messages to cancel stack workers is sent only when
> required and there is no other over-head. The draw-back is that the
> cancel message is 'broadcasted' to all heat engines, even if they are
> not running any workers for the given stack, though, in such cases, it
> will be a just no-op for the heat-engine (the message will be gracefully
> discarded).

Oh hah, I just sent (b) as an option to avoid (a) without really
thinking about (b) again.

I don't think the cancel broadcasts are all that much of a drawback. I
do think you need to rate limit cancels though, or you give users the
chance to DDoS the system.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [heat] convergence cancel messages

2016-02-24 Thread Clint Byrum
Excerpts from Anant Patil's message of 2016-02-24 00:56:34 -0800:
> On 24-Feb-16 13:12, Clint Byrum wrote:
> > Excerpts from Anant Patil's message of 2016-02-23 23:08:31 -0800:
> >> Hi,
> >>
> >> I would like the discuss various approaches towards fixing bug
> >> https://launchpad.net/bugs/1533176
> >>
> >> When convergence is on, and if the stack is stuck, there is no way to
> >> cancel the existing request. This feature was not implemented in
> >> convergence, as the user can again issue an update on an in-progress
> >> stack. But if a resource worker is stuck, the new update will wait
> >> for-ever on it and the update will not be effective.
> >>
> >> The solution is to implement cancel request. Since the work for a stack
> >> is distributed among heat engines, the cancel request will not work as
> >> it does in legacy way. Many or all of the heat engines might be running
> >> worker threads to provision a stack.
> >>
> >> I could think of two options which I would like to discuss:
> >>
> >> (a) When a user triggered cancel request is received, set the stack
> >> current traversal to None or something else other than current
> >> traversal. With this the new check-resources/workers will never be
> >> triggered. This is okay as long as the worker(s) is not stuck. The
> >> existing workers will finish running, and no new check-resource
> >> (workers) will be triggered, and it will be a graceful cancel.  But the
> >> workers that are stuck will be stuck for-ever till stack times-out.  To
> >> take care of such cases, we will have to implement logic of "polling"
> >> the DB at regular intervals (may be at each step() of scheduler task)
> >> and bail out if the current traversal is updated. Basically, each worker
> >> will "poll" the DB to see if the current traversal is still valid and if
> >> not, stop itself. The drawback of this approach is that all the workers
> >> will be hitting the DB and incur a significant overhead.  Besides, all
> >> the stack workers irrespective of whether they will be cancelled or not,
> >> will keep on hitting DB. The advantage is that it probably is easier to
> >> implement. Also, if the worker is stuck in particular "step", then this
> >> approach will not work.
> >>
> > 
> > I think this is the simplest option. And if the polling gets to be too
> > much, you can implement an observer pattern where one worker is just
> > assigned to poll the traversal and if it changes, RPC to the known
> > active workers that they should cancel any jobs using a now-cancelled
> > stack version.
> > 
> > __
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > 
> 
> Hi Clint,
> 
> I see that observer pattern is simple, but IMO it too is not efficient.
> To implement it, we will have to note down in DB the worker to engine-id
> relationship for all the workers, and then go through all of them and
> send targeted cancel messages. This will also need us to have thread
> group manager in each engine so that it can stop the thread group
> running workers for the stack.
> 

You have to have that thread group manager anyway, or you can't ever
cancel anything in progress. That same thread group manager could also
be managing timeouts.

Apologies for my lack of understanding of where the implementation
has gone, I thought you would already have that mapping in the DB. If
that's a problem though, for this case you can have a notification
channel for cancellations, and have the management thread listen to
that, with its own local awareness of what is being worked on.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [heat] convergence cancel messages

2016-02-24 Thread Anant Patil
On 24-Feb-16 14:26, Anant Patil wrote:
> On 24-Feb-16 13:12, Clint Byrum wrote:
>> Excerpts from Anant Patil's message of 2016-02-23 23:08:31 -0800:
>>> Hi,
>>>
>>> I would like the discuss various approaches towards fixing bug
>>> https://launchpad.net/bugs/1533176
>>>
>>> When convergence is on, and if the stack is stuck, there is no way to
>>> cancel the existing request. This feature was not implemented in
>>> convergence, as the user can again issue an update on an in-progress
>>> stack. But if a resource worker is stuck, the new update will wait
>>> for-ever on it and the update will not be effective.
>>>
>>> The solution is to implement cancel request. Since the work for a stack
>>> is distributed among heat engines, the cancel request will not work as
>>> it does in legacy way. Many or all of the heat engines might be running
>>> worker threads to provision a stack.
>>>
>>> I could think of two options which I would like to discuss:
>>>
>>> (a) When a user triggered cancel request is received, set the stack
>>> current traversal to None or something else other than current
>>> traversal. With this the new check-resources/workers will never be
>>> triggered. This is okay as long as the worker(s) is not stuck. The
>>> existing workers will finish running, and no new check-resource
>>> (workers) will be triggered, and it will be a graceful cancel.  But the
>>> workers that are stuck will be stuck for-ever till stack times-out.  To
>>> take care of such cases, we will have to implement logic of "polling"
>>> the DB at regular intervals (may be at each step() of scheduler task)
>>> and bail out if the current traversal is updated. Basically, each worker
>>> will "poll" the DB to see if the current traversal is still valid and if
>>> not, stop itself. The drawback of this approach is that all the workers
>>> will be hitting the DB and incur a significant overhead.  Besides, all
>>> the stack workers irrespective of whether they will be cancelled or not,
>>> will keep on hitting DB. The advantage is that it probably is easier to
>>> implement. Also, if the worker is stuck in particular "step", then this
>>> approach will not work.
>>>
>>
>> I think this is the simplest option. And if the polling gets to be too
>> much, you can implement an observer pattern where one worker is just
>> assigned to poll the traversal and if it changes, RPC to the known
>> active workers that they should cancel any jobs using a now-cancelled
>> stack version.
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
> 
> Hi Clint,
> 
> I see that observer pattern is simple, but IMO it too is not efficient.
> To implement it, we will have to note down in DB the worker to engine-id
> relationship for all the workers, and then go through all of them and
> send targeted cancel messages. This will also need us to have thread
> group manager in each engine so that it can stop the thread group
> running workers for the stack.
> 
> Please help me understand if there is any particular disadvantage in
> option (b) that I am not missing.

Sorry, I meant I am missing :)

> 
> -- Anant
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [heat] convergence cancel messages

2016-02-24 Thread Anant Patil
On 24-Feb-16 13:12, Clint Byrum wrote:
> Excerpts from Anant Patil's message of 2016-02-23 23:08:31 -0800:
>> Hi,
>>
>> I would like the discuss various approaches towards fixing bug
>> https://launchpad.net/bugs/1533176
>>
>> When convergence is on, and if the stack is stuck, there is no way to
>> cancel the existing request. This feature was not implemented in
>> convergence, as the user can again issue an update on an in-progress
>> stack. But if a resource worker is stuck, the new update will wait
>> for-ever on it and the update will not be effective.
>>
>> The solution is to implement cancel request. Since the work for a stack
>> is distributed among heat engines, the cancel request will not work as
>> it does in legacy way. Many or all of the heat engines might be running
>> worker threads to provision a stack.
>>
>> I could think of two options which I would like to discuss:
>>
>> (a) When a user triggered cancel request is received, set the stack
>> current traversal to None or something else other than current
>> traversal. With this the new check-resources/workers will never be
>> triggered. This is okay as long as the worker(s) is not stuck. The
>> existing workers will finish running, and no new check-resource
>> (workers) will be triggered, and it will be a graceful cancel.  But the
>> workers that are stuck will be stuck for-ever till stack times-out.  To
>> take care of such cases, we will have to implement logic of "polling"
>> the DB at regular intervals (may be at each step() of scheduler task)
>> and bail out if the current traversal is updated. Basically, each worker
>> will "poll" the DB to see if the current traversal is still valid and if
>> not, stop itself. The drawback of this approach is that all the workers
>> will be hitting the DB and incur a significant overhead.  Besides, all
>> the stack workers irrespective of whether they will be cancelled or not,
>> will keep on hitting DB. The advantage is that it probably is easier to
>> implement. Also, if the worker is stuck in particular "step", then this
>> approach will not work.
>>
> 
> I think this is the simplest option. And if the polling gets to be too
> much, you can implement an observer pattern where one worker is just
> assigned to poll the traversal and if it changes, RPC to the known
> active workers that they should cancel any jobs using a now-cancelled
> stack version.
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 

Hi Clint,

I see that observer pattern is simple, but IMO it too is not efficient.
To implement it, we will have to note down in DB the worker to engine-id
relationship for all the workers, and then go through all of them and
send targeted cancel messages. This will also need us to have thread
group manager in each engine so that it can stop the thread group
running workers for the stack.

Please help me understand if there is any particular disadvantage in
option (b) that I am not missing.

-- Anant

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [heat] convergence cancel messages

2016-02-23 Thread Clint Byrum
Excerpts from Anant Patil's message of 2016-02-23 23:08:31 -0800:
> Hi,
> 
> I would like the discuss various approaches towards fixing bug
> https://launchpad.net/bugs/1533176
> 
> When convergence is on, and if the stack is stuck, there is no way to
> cancel the existing request. This feature was not implemented in
> convergence, as the user can again issue an update on an in-progress
> stack. But if a resource worker is stuck, the new update will wait
> for-ever on it and the update will not be effective.
> 
> The solution is to implement cancel request. Since the work for a stack
> is distributed among heat engines, the cancel request will not work as
> it does in legacy way. Many or all of the heat engines might be running
> worker threads to provision a stack.
> 
> I could think of two options which I would like to discuss:
> 
> (a) When a user triggered cancel request is received, set the stack
> current traversal to None or something else other than current
> traversal. With this the new check-resources/workers will never be
> triggered. This is okay as long as the worker(s) is not stuck. The
> existing workers will finish running, and no new check-resource
> (workers) will be triggered, and it will be a graceful cancel.  But the
> workers that are stuck will be stuck for-ever till stack times-out.  To
> take care of such cases, we will have to implement logic of "polling"
> the DB at regular intervals (may be at each step() of scheduler task)
> and bail out if the current traversal is updated. Basically, each worker
> will "poll" the DB to see if the current traversal is still valid and if
> not, stop itself. The drawback of this approach is that all the workers
> will be hitting the DB and incur a significant overhead.  Besides, all
> the stack workers irrespective of whether they will be cancelled or not,
> will keep on hitting DB. The advantage is that it probably is easier to
> implement. Also, if the worker is stuck in particular "step", then this
> approach will not work.
> 

I think this is the simplest option. And if the polling gets to be too
much, you can implement an observer pattern where one worker is just
assigned to poll the traversal and if it changes, RPC to the known
active workers that they should cancel any jobs using a now-cancelled
stack version.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [heat] convergence cancel messages

2016-02-23 Thread Anant Patil
Hi,

I would like the discuss various approaches towards fixing bug
https://launchpad.net/bugs/1533176

When convergence is on, and if the stack is stuck, there is no way to
cancel the existing request. This feature was not implemented in
convergence, as the user can again issue an update on an in-progress
stack. But if a resource worker is stuck, the new update will wait
for-ever on it and the update will not be effective.

The solution is to implement cancel request. Since the work for a stack
is distributed among heat engines, the cancel request will not work as
it does in legacy way. Many or all of the heat engines might be running
worker threads to provision a stack.

I could think of two options which I would like to discuss:

(a) When a user triggered cancel request is received, set the stack
current traversal to None or something else other than current
traversal. With this the new check-resources/workers will never be
triggered. This is okay as long as the worker(s) is not stuck. The
existing workers will finish running, and no new check-resource
(workers) will be triggered, and it will be a graceful cancel.  But the
workers that are stuck will be stuck for-ever till stack times-out.  To
take care of such cases, we will have to implement logic of "polling"
the DB at regular intervals (may be at each step() of scheduler task)
and bail out if the current traversal is updated. Basically, each worker
will "poll" the DB to see if the current traversal is still valid and if
not, stop itself. The drawback of this approach is that all the workers
will be hitting the DB and incur a significant overhead.  Besides, all
the stack workers irrespective of whether they will be cancelled or not,
will keep on hitting DB. The advantage is that it probably is easier to
implement. Also, if the worker is stuck in particular "step", then this
approach will not work.

(b) Another approach is to send cancel message to all the heat engines
when one receives a stack cancel request. The idea is to use the thread
group manager in each engine to keep track of threads running for a
stack, and stop the thread group when a cancel message is received. The
advantage is that the messages to cancel stack workers is sent only when
required and there is no other over-head. The draw-back is that the
cancel message is 'broadcasted' to all heat engines, even if they are
not running any workers for the given stack, though, in such cases, it
will be a just no-op for the heat-engine (the message will be gracefully
discarded).

Implementation for option (b) is for review:
https://review.openstack.org/#/c/279406/

I am seeking your input on these approaches. Please share any other
ideas if you have.

-- Anant

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev