Re: [openstack-dev] [heat][tripleo] User Initiated Rollback

2015-12-03 Thread Steven Hardy
On Thu, Dec 03, 2015 at 08:11:41AM -0500, Dan Prince wrote:
> On Wed, 2015-12-02 at 16:02 +, Steven Hardy wrote:
> > So, chatting with Giulio today about https://bugs.launchpad.net/heat/
> > +bug/1521944
> > has be thinking about $subject.
> > 
> > The root case of that issue is essentially a corner case of a stack-
> > update,
> > combined with some coupling within the Neutron API which prevents the
> > update traversal from working.
> > 
> > But it raises the broader question of what a "rollback" actually is,
> > and
> > how a user can potentially use it to get out of the kind of mess
> > described
> > in that bug (where, otherwise, your only option is to delete the
> > entire
> > stack).
> > 
> > Currently, we treat rollback as a special type of update, where, if
> > an
> > in-progress update fails, we then try to update again, to the
> > previous
> > stack definition[1], but as Giulio has discovered, there are times
> > when
> > that doesn't work, because what you actually want is to recover the
> > existing resource from the backup stack, not create a new one with
> > the same
> > properties.
> 
> Is there more information about this case (a bug perhaps)? Presumably
> it is an OpenStack resource you are talking about here... like a Nova
> Server or Neutron Network Port?

Well the bug is linked above (1521944), but there's no bug specific to
rollback.

As Zane has pointed out, heat is actually working as desired here, because
we aren't able to differentiate an attempt to delete a neutron port which
results in "not allowed, in use" with "500, I am broken".

I was hoping there was some way to make this easier via rollback, but
increasingly it seems the solution is not to tell Heat to do the wrong
thing (which is the root cause of this issue).

There are a few ways we can do that:

1. Stop defining default "noop" resources in
overcloud-resource-registry-puppet.yaml - it makes it too easy to
accidentally switch to a noop (destructive) implementation on update.

2. Improve heat stack update preview, so it handles nested stacks, then we
can easily have a pre-update validation step, which for example checks (and
warns, loudly) if any resources will be deleted (particularly network and
server resources..)  I'm working on this ref:

https://bugs.launchpad.net/heat/+bug/1521971

3. Implement a template annotation which allows you to say "don't update"
for certain resources, such as servers and network ports etc.  Rabi is
working on this, here's the (old) BP which didn't get implemented but I
think will help us:

https://github.com/openstack/heat-specs/blob/master/specs/kilo/stack-update-restrict.rst

> > Then, looking at convergence, we have a different definition of
> > rollback,
> > it's not yet clear to me how this should behave in a similar
> > scenario, e.g
> > when the resource we want to roll back to failed to get deleted but
> > still
> > exists (so, the resource is FAILED, but the underlying resource is
> > fine)?
> > 
> > Finally, the interface to rollback - atm you have to know before
> > something
> > fails that you'd like to enable rollback for a specific update.  This
> > seems
> > suboptimal, since invariably by the time you know you need rollback,
> > it's
> > too late.  Can we enable a user-initiated rollback from a FAILED
> > state, via
> > one of:
> > 
> >  - Introduce a new heat API that allows an explicit heat stack-
> > rollback?
> >  - (ab)use PATCH to trigger rollback on heat stack-update -x --
> > rollback=True?
> > 
> > The former approach fits better with the current stack.Stack
> > implementation, because the ROLLBACK stack state already exists.  The
> > latter has the advantage that it doesn't need a new API so might be
> > backportable.
> > 
> > Any thoughts on how we might proceed to make this situation better,
> > and
> > enable folks to roll back in the least destructive way possible when
> > they
> > end up in a FAILED state?
> 
> From a TripleO standpoint I would really like to end up in a place
> where we aren't thinking of Heat as a rollback tool and more of a make
> it so tool. I think there might be a small case for the
> "infrastructure" side where Heat is creating OpenStack objects for us
> (servers and ports). We'd like not to destroy/replace these when we
> update the "infrastructure" pieces of our stack and if things go badly
> on an update you just want to stay in the (hopefully still working)
> previous state.

Yeah, keeping the infrastructure and software configuration more cleanly
separated will help, but we still need much better pre-update validation.

> On the configuration (currently software deployments driving puppet) I
> would very much like to have Heat be a make-it so tool that does what
> we tell it. If I wanted to roll back the configuration I would prefer
> to simply do another heat stack-update with the previous
> parameters/manifests/etc. Or perhaps more drastically, delete the
> entire configuration stack and heat stack-create with the previous 

Re: [openstack-dev] [heat][tripleo] User Initiated Rollback

2015-12-03 Thread Dan Prince
On Wed, 2015-12-02 at 16:02 +, Steven Hardy wrote:
> So, chatting with Giulio today about https://bugs.launchpad.net/heat/
> +bug/1521944
> has be thinking about $subject.
> 
> The root case of that issue is essentially a corner case of a stack-
> update,
> combined with some coupling within the Neutron API which prevents the
> update traversal from working.
> 
> But it raises the broader question of what a "rollback" actually is,
> and
> how a user can potentially use it to get out of the kind of mess
> described
> in that bug (where, otherwise, your only option is to delete the
> entire
> stack).
> 
> Currently, we treat rollback as a special type of update, where, if
> an
> in-progress update fails, we then try to update again, to the
> previous
> stack definition[1], but as Giulio has discovered, there are times
> when
> that doesn't work, because what you actually want is to recover the
> existing resource from the backup stack, not create a new one with
> the same
> properties.

Is there more information about this case (a bug perhaps)? Presumably
it is an OpenStack resource you are talking about here... like a Nova
Server or Neutron Network Port?


> 
> Then, looking at convergence, we have a different definition of
> rollback,
> it's not yet clear to me how this should behave in a similar
> scenario, e.g
> when the resource we want to roll back to failed to get deleted but
> still
> exists (so, the resource is FAILED, but the underlying resource is
> fine)?
> 
> Finally, the interface to rollback - atm you have to know before
> something
> fails that you'd like to enable rollback for a specific update.  This
> seems
> suboptimal, since invariably by the time you know you need rollback,
> it's
> too late.  Can we enable a user-initiated rollback from a FAILED
> state, via
> one of:
> 
>  - Introduce a new heat API that allows an explicit heat stack-
> rollback?
>  - (ab)use PATCH to trigger rollback on heat stack-update -x --
> rollback=True?
> 
> The former approach fits better with the current stack.Stack
> implementation, because the ROLLBACK stack state already exists.  The
> latter has the advantage that it doesn't need a new API so might be
> backportable.
> 
> Any thoughts on how we might proceed to make this situation better,
> and
> enable folks to roll back in the least destructive way possible when
> they
> end up in a FAILED state?

From a TripleO standpoint I would really like to end up in a place
where we aren't thinking of Heat as a rollback tool and more of a make
it so tool. I think there might be a small case for the
"infrastructure" side where Heat is creating OpenStack objects for us
(servers and ports). We'd like not to destroy/replace these when we
update the "infrastructure" pieces of our stack and if things go badly
on an update you just want to stay in the (hopefully still working)
previous state.

On the configuration (currently software deployments driving puppet) I
would very much like to have Heat be a make-it so tool that does what
we tell it. If I wanted to roll back the configuration I would prefer
to simply do another heat stack-update with the previous
parameters/manifests/etc. Or perhaps more drastically, delete the
entire configuration stack and heat stack-create with the previous one.
Puppet is meant to be idempotent so re-running a previously working
manifests might be just what you want. This wouldn't cover all cases
for rollback... and there are certainly things where you'd want a
custom ad-hoc puppet snippet or bash script to run before you did a
follow up heat stack-update to put things back like they were. For
these cases I think perhaps workflow tools to perhaps help drive our
Heat configuration orchestration could work well.

Dan

> 
> Steve
> 
> [1] https://github.com/openstack/heat/blob/master/heat/engine/stack.p
> y#L1331
> [2] https://github.com/openstack/heat/blob/master/heat/engine/stack.p
> y#L1143
> 
> _
> _
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubs
> cribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [heat][tripleo] User Initiated Rollback

2015-12-03 Thread Steve Baker

On 04/12/15 03:41, Steven Hardy wrote:

On Thu, Dec 03, 2015 at 08:11:41AM -0500, Dan Prince wrote:

On Wed, 2015-12-02 at 16:02 +, Steven Hardy wrote:

So, chatting with Giulio today about https://bugs.launchpad.net/heat/
+bug/1521944
has be thinking about $subject.

The root case of that issue is essentially a corner case of a stack-
update,
combined with some coupling within the Neutron API which prevents the
update traversal from working.

But it raises the broader question of what a "rollback" actually is,
and
how a user can potentially use it to get out of the kind of mess
described
in that bug (where, otherwise, your only option is to delete the
entire
stack).

Currently, we treat rollback as a special type of update, where, if
an
in-progress update fails, we then try to update again, to the
previous
stack definition[1], but as Giulio has discovered, there are times
when
that doesn't work, because what you actually want is to recover the
existing resource from the backup stack, not create a new one with
the same
properties.

Is there more information about this case (a bug perhaps)? Presumably
it is an OpenStack resource you are talking about here... like a Nova
Server or Neutron Network Port?

Well the bug is linked above (1521944), but there's no bug specific to
rollback.

As Zane has pointed out, heat is actually working as desired here, because
we aren't able to differentiate an attempt to delete a neutron port which
results in "not allowed, in use" with "500, I am broken".

I was hoping there was some way to make this easier via rollback, but
increasingly it seems the solution is not to tell Heat to do the wrong
thing (which is the root cause of this issue).

There are a few ways we can do that:

1. Stop defining default "noop" resources in
overcloud-resource-registry-puppet.yaml - it makes it too easy to
accidentally switch to a noop (destructive) implementation on update.
Splitting out the noop stubs into their own environment that only gets 
included on overcloud create would certainly lower the risk of 
customizations being overwritten by stubs. We would just need a strategy 
for when new types are added that need to be stubbed by default.

2. Improve heat stack update preview, so it handles nested stacks, then we
can easily have a pre-update validation step, which for example checks (and
warns, loudly) if any resources will be deleted (particularly network and
server resources..)  I'm working on this ref:

https://bugs.launchpad.net/heat/+bug/1521971
We should definitely do this once pre-update works for nested stacks. 
tripleoclient could have a whitelist of resource types which generally 
shouldn't be replaced (subnets, ports, servers) and prompt the user with 
a list of resource which will be replaced and a N/y question to continue.



3. Implement a template annotation which allows you to say "don't update"
for certain resources, such as servers and network ports etc.  Rabi is
working on this, here's the (old) BP which didn't get implemented but I
think will help us:

https://github.com/openstack/heat-specs/blob/master/specs/kilo/stack-update-restrict.rst
Yes, a way of declaring a resource as not replaceable would also 
increase safety (in-place updates should be fine though)



Then, looking at convergence, we have a different definition of
rollback,
it's not yet clear to me how this should behave in a similar
scenario, e.g
when the resource we want to roll back to failed to get deleted but
still
exists (so, the resource is FAILED, but the underlying resource is
fine)?

Finally, the interface to rollback - atm you have to know before
something
fails that you'd like to enable rollback for a specific update.  This
seems
suboptimal, since invariably by the time you know you need rollback,
it's
too late.  Can we enable a user-initiated rollback from a FAILED
state, via
one of:

  - Introduce a new heat API that allows an explicit heat stack-
rollback?
  - (ab)use PATCH to trigger rollback on heat stack-update -x --
rollback=True?

The former approach fits better with the current stack.Stack
implementation, because the ROLLBACK stack state already exists.  The
latter has the advantage that it doesn't need a new API so might be
backportable.

Any thoughts on how we might proceed to make this situation better,
and
enable folks to roll back in the least destructive way possible when
they
end up in a FAILED state?

 From a TripleO standpoint I would really like to end up in a place
where we aren't thinking of Heat as a rollback tool and more of a make
it so tool. I think there might be a small case for the
"infrastructure" side where Heat is creating OpenStack objects for us
(servers and ports). We'd like not to destroy/replace these when we
update the "infrastructure" pieces of our stack and if things go badly
on an update you just want to stay in the (hopefully still working)
previous state.

Yeah, keeping the infrastructure and software configuration more cleanly
separated will help, 

Re: [openstack-dev] [heat][tripleo] User Initiated Rollback

2015-12-03 Thread Clint Byrum
Zane I want to echo your sentiments exactly below. I agree with all of
the things basically.

The only thing I'd add is that no matter how good you make Heat's rollback
API, it will never be as good as git. So I would suggest that you just
have people roll forward from Heat's perspective, and let VCS systems
handle history tracking. What might help with that would be maybe some
key/value pairs that would allow people to set something like this:

remote=https://github.com/myorg/mytemplates gitref=tags/1.2.9

So that anything interrogating Heat can know whether the latest reference
is applied.

Excerpts from Zane Bitter's message of 2015-12-02 08:51:55 -0800:
> On 02/12/15 11:02, Steven Hardy wrote:
> > So, chatting with Giulio today about 
> > https://bugs.launchpad.net/heat/+bug/1521944
> > has be thinking about $subject.
> >
> > The root case of that issue is essentially a corner case of a stack-update,
> > combined with some coupling within the Neutron API which prevents the
> > update traversal from working.
> >
> > But it raises the broader question of what a "rollback" actually is, and
> > how a user can potentially use it to get out of the kind of mess described
> > in that bug (where, otherwise, your only option is to delete the entire
> > stack).
> 
> I'm not sure it does raise that question; the same issue crops up 
> whether you try to roll back or roll forward.
> 
> > Currently, we treat rollback as a special type of update, where, if an
> > in-progress update fails, we then try to update again, to the previous
> > stack definition[1], but as Giulio has discovered, there are times when
> > that doesn't work, because what you actually want is to recover the
> > existing resource from the backup stack, not create a new one with the same
> > properties.
> 
> The rollback flow isn't the problem here. The problem is that the 
> resource is marked as DELETE_FAILED, and Heat has no mechanism in 
> general for knowing if that means it's still good and we can restore it 
> or if it is, as we say in New Zealand, completely munted[1].
> 
> Since Heat can't know, it assumes the latter and replaces the resource. 
> If we wanted to fix this, we'd need a mechanism to verify the health of 
> the resource - and obviously it would have to be resource-specific. We 
> already have an interface for that kind of mechanism in the form of 
> handle_check(), so there's a chance we could repurpose that to do this.
> 
> [1] http://dictionary.reference.com/browse/munted?s=t
> 
> > Then, looking at convergence, we have a different definition of rollback,
> > it's not yet clear to me how this should behave in a similar scenario, e.g
> > when the resource we want to roll back to failed to get deleted but still
> > exists (so, the resource is FAILED, but the underlying resource is fine)?
> 
> It's essentially the same. Convergence behaves a bit better when 
> multiple failed versions of the same resource start stacking up, but it 
> won't solve the problem.
> 
> > Finally, the interface to rollback - atm you have to know before something
> > fails that you'd like to enable rollback for a specific update.  This seems
> > suboptimal, since invariably by the time you know you need rollback, it's
> > too late.  Can we enable a user-initiated rollback from a FAILED state, via
> > one of:
> >
> >   - Introduce a new heat API that allows an explicit heat stack-rollback?
> >   - (ab)use PATCH to trigger rollback on heat stack-update -x 
> > --rollback=True?
> 
> In convergence there's no distinction between a rollback and an update 
> using the previous template, so IMHO there's not much need for a 
> separate API.
> 
> > The former approach fits better with the current stack.Stack
> > implementation, because the ROLLBACK stack state already exists.  The
> > latter has the advantage that it doesn't need a new API so might be
> > backportable.
> 
> Convergence does store a copy of the previous template (not 100% sure 
> when it deletes it at the moment - I suspect after the update succeeds), 
> so a rollback API would be feasible if we decided we needed it. I'd 
> prefer the first approach if so.
> 
> > Any thoughts on how we might proceed to make this situation better, and
> > enable folks to roll back in the least destructive way possible when they
> > end up in a FAILED state?
> 
> Note that the root cause of this problem is that Heat doesn't have a 
> global view of dependencies across stacks - if it did it would never 
> have tried to delete the subnet with ports still in it. For the benefit 
> of those who weren't at the design summit, we discussed potential fixes 
> there:
> 
> https://etherpad.openstack.org/p/mitaka-heat-break-stack-barrier
> 
> cheers,
> Zane.
> 
> > Steve
> >
> > [1] https://github.com/openstack/heat/blob/master/heat/engine/stack.py#L1331
> > [2] https://github.com/openstack/heat/blob/master/heat/engine/stack.py#L1143
> >
> > __
> > 

Re: [openstack-dev] [heat][tripleo] User Initiated Rollback

2015-12-03 Thread Zane Bitter

On 03/12/15 09:41, Steven Hardy wrote:

On Thu, Dec 03, 2015 at 08:11:41AM -0500, Dan Prince wrote:

On Wed, 2015-12-02 at 16:02 +, Steven Hardy wrote:

So, chatting with Giulio today about https://bugs.launchpad.net/heat/
+bug/1521944
has be thinking about $subject.

The root case of that issue is essentially a corner case of a stack-
update,
combined with some coupling within the Neutron API which prevents the
update traversal from working.

But it raises the broader question of what a "rollback" actually is,
and
how a user can potentially use it to get out of the kind of mess
described
in that bug (where, otherwise, your only option is to delete the
entire
stack).

Currently, we treat rollback as a special type of update, where, if
an
in-progress update fails, we then try to update again, to the
previous
stack definition[1], but as Giulio has discovered, there are times
when
that doesn't work, because what you actually want is to recover the
existing resource from the backup stack, not create a new one with
the same
properties.


Is there more information about this case (a bug perhaps)? Presumably
it is an OpenStack resource you are talking about here... like a Nova
Server or Neutron Network Port?


Well the bug is linked above (1521944), but there's no bug specific to
rollback.

As Zane has pointed out, heat is actually working as desired here, because
we aren't able to differentiate an attempt to delete a neutron port which
results in "not allowed, in use" with "500, I am broken".


I wouldn't say that we're not able so much as we just don't. Changing 
that *in general* is very hard (you have enumerate every way any type of 
resource could fail and figure out how to handle each one). But if this 
particular problem is causing us pain then it's extremely amenable to a 
custom local solution. (I wouldn't even call this a hack; resource 
plugins are explicitly allowed to customise their update logic.)


I added a comment on the bug to that effect. I would insert this as 
option 0 in your list below.



I was hoping there was some way to make this easier via rollback, but
increasingly it seems the solution is not to tell Heat to do the wrong
thing (which is the root cause of this issue).

There are a few ways we can do that:

1. Stop defining default "noop" resources in
overcloud-resource-registry-puppet.yaml - it makes it too easy to
accidentally switch to a noop (destructive) implementation on update.

2. Improve heat stack update preview, so it handles nested stacks, then we
can easily have a pre-update validation step, which for example checks (and
warns, loudly) if any resources will be deleted (particularly network and
server resources..)  I'm working on this ref:

https://bugs.launchpad.net/heat/+bug/1521971

3. Implement a template annotation which allows you to say "don't update"
for certain resources, such as servers and network ports etc.  Rabi is
working on this,


I... am not aware that anyone is actively working on this at this moment.


here's the (old) BP which didn't get implemented but I
think will help us:

https://github.com/openstack/heat-specs/blob/master/specs/kilo/stack-update-restrict.rst


Hmm, apparently I missed the boat on this spec review.

I think that putting this stuff in the template is a mistake. I don't 
see update restrictions as a way of declaring stuff that will never 
change (how could you possibly know this in advance when, as a former 
colleague once memorably put it, "it's not the future yet"?). Rather, it 
should be a way of putting either a user or a validation tool into the loop:


stack update-preview -> approve changes -> stack update

In the most common use case, you'd just be passing the output of 
update-preview directly into update, to restrict Heat to making only the 
changes you approved. IMHO this argues strongly for this data to be 
passed in alongside the template, not edited into it.


Also, putting restrictions like this into the template breaks the 
declarative model. Say you have a Git history of your deployment... you 
should be able to move from any point in that history to any other point 
with a single stack update. If there are times when that would be a bad 
idea and you want a tool to stop you from even trying in those 
situations, that's great. But if you have to step through every 
intermediate revision in order to be able to guarantee a transformation 
from one revision to another, then what you have is no longer a 
declarative system in any meaningful sense.


To give a concrete example:

Template A: You create some resources
Template B: You realise that some of them need to change
Template C: You decide that those ones are now fixed for all time, so 
you disallow further updates of them


A user who has deployed template version A can no longer update to the 
latest version, C, without first updating to B.


i.e. we're mixing declarations that are absolute with stuff that is 
relative to the last revision.


I'll paste the 

[openstack-dev] [heat][tripleo] User Initiated Rollback

2015-12-02 Thread Steven Hardy
So, chatting with Giulio today about 
https://bugs.launchpad.net/heat/+bug/1521944
has be thinking about $subject.

The root case of that issue is essentially a corner case of a stack-update,
combined with some coupling within the Neutron API which prevents the
update traversal from working.

But it raises the broader question of what a "rollback" actually is, and
how a user can potentially use it to get out of the kind of mess described
in that bug (where, otherwise, your only option is to delete the entire
stack).

Currently, we treat rollback as a special type of update, where, if an
in-progress update fails, we then try to update again, to the previous
stack definition[1], but as Giulio has discovered, there are times when
that doesn't work, because what you actually want is to recover the
existing resource from the backup stack, not create a new one with the same
properties.

Then, looking at convergence, we have a different definition of rollback,
it's not yet clear to me how this should behave in a similar scenario, e.g
when the resource we want to roll back to failed to get deleted but still
exists (so, the resource is FAILED, but the underlying resource is fine)?

Finally, the interface to rollback - atm you have to know before something
fails that you'd like to enable rollback for a specific update.  This seems
suboptimal, since invariably by the time you know you need rollback, it's
too late.  Can we enable a user-initiated rollback from a FAILED state, via
one of:

 - Introduce a new heat API that allows an explicit heat stack-rollback?
 - (ab)use PATCH to trigger rollback on heat stack-update -x --rollback=True?

The former approach fits better with the current stack.Stack
implementation, because the ROLLBACK stack state already exists.  The
latter has the advantage that it doesn't need a new API so might be
backportable.

Any thoughts on how we might proceed to make this situation better, and
enable folks to roll back in the least destructive way possible when they
end up in a FAILED state?

Steve

[1] https://github.com/openstack/heat/blob/master/heat/engine/stack.py#L1331
[2] https://github.com/openstack/heat/blob/master/heat/engine/stack.py#L1143

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [heat][tripleo] User Initiated Rollback

2015-12-02 Thread Zane Bitter

On 02/12/15 11:02, Steven Hardy wrote:

So, chatting with Giulio today about 
https://bugs.launchpad.net/heat/+bug/1521944
has be thinking about $subject.

The root case of that issue is essentially a corner case of a stack-update,
combined with some coupling within the Neutron API which prevents the
update traversal from working.

But it raises the broader question of what a "rollback" actually is, and
how a user can potentially use it to get out of the kind of mess described
in that bug (where, otherwise, your only option is to delete the entire
stack).


I'm not sure it does raise that question; the same issue crops up 
whether you try to roll back or roll forward.



Currently, we treat rollback as a special type of update, where, if an
in-progress update fails, we then try to update again, to the previous
stack definition[1], but as Giulio has discovered, there are times when
that doesn't work, because what you actually want is to recover the
existing resource from the backup stack, not create a new one with the same
properties.


The rollback flow isn't the problem here. The problem is that the 
resource is marked as DELETE_FAILED, and Heat has no mechanism in 
general for knowing if that means it's still good and we can restore it 
or if it is, as we say in New Zealand, completely munted[1].


Since Heat can't know, it assumes the latter and replaces the resource. 
If we wanted to fix this, we'd need a mechanism to verify the health of 
the resource - and obviously it would have to be resource-specific. We 
already have an interface for that kind of mechanism in the form of 
handle_check(), so there's a chance we could repurpose that to do this.


[1] http://dictionary.reference.com/browse/munted?s=t


Then, looking at convergence, we have a different definition of rollback,
it's not yet clear to me how this should behave in a similar scenario, e.g
when the resource we want to roll back to failed to get deleted but still
exists (so, the resource is FAILED, but the underlying resource is fine)?


It's essentially the same. Convergence behaves a bit better when 
multiple failed versions of the same resource start stacking up, but it 
won't solve the problem.



Finally, the interface to rollback - atm you have to know before something
fails that you'd like to enable rollback for a specific update.  This seems
suboptimal, since invariably by the time you know you need rollback, it's
too late.  Can we enable a user-initiated rollback from a FAILED state, via
one of:

  - Introduce a new heat API that allows an explicit heat stack-rollback?
  - (ab)use PATCH to trigger rollback on heat stack-update -x --rollback=True?


In convergence there's no distinction between a rollback and an update 
using the previous template, so IMHO there's not much need for a 
separate API.



The former approach fits better with the current stack.Stack
implementation, because the ROLLBACK stack state already exists.  The
latter has the advantage that it doesn't need a new API so might be
backportable.


Convergence does store a copy of the previous template (not 100% sure 
when it deletes it at the moment - I suspect after the update succeeds), 
so a rollback API would be feasible if we decided we needed it. I'd 
prefer the first approach if so.



Any thoughts on how we might proceed to make this situation better, and
enable folks to roll back in the least destructive way possible when they
end up in a FAILED state?


Note that the root cause of this problem is that Heat doesn't have a 
global view of dependencies across stacks - if it did it would never 
have tried to delete the subnet with ports still in it. For the benefit 
of those who weren't at the design summit, we discussed potential fixes 
there:


https://etherpad.openstack.org/p/mitaka-heat-break-stack-barrier

cheers,
Zane.


Steve

[1] https://github.com/openstack/heat/blob/master/heat/engine/stack.py#L1331
[2] https://github.com/openstack/heat/blob/master/heat/engine/stack.py#L1143

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev