Re: [openstack-dev] [Heat][Summit] Input wanted - real world heat spec

2014-04-27 Thread Steve Baker
On 25/04/14 11:29, Clint Byrum wrote:
 Also by loading the whole stack we've allowed resources to bleed into
 other resource. Currently to read Metadata for a single item that
 entails _a lot_ of queries to the database because we end up having to
 load the entire stack. We can't continue that as stacks grow in size.
As an aside, this changeset[1] results in a stack load requiring _one_
query instead of _a lot_. Clint's argument still stands though.

[1]
https://review.openstack.org/#/q/status:open+project:openstack/heat+branch:master+topic:bug/1306743,n,z

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat][Summit] Input wanted - real world heat spec

2014-04-24 Thread Chris Armstrong
On April 23, 2014 at 7:47:37 PM, Robert Collins 
(robe...@robertcollins.netmailto:robe...@robertcollins.net) wrote:
Hi, we've got this summit session planned -
http://summit.openstack.org/cfp/details/428 which is really about
https://etherpad.openstack.org/p/heat-workflow-vs-convergence

We'd love feedback and questions - this is a significant amount of
work, but work I (and many others based on responses so far) believe
it is needed to really take Heat to users and ops teams.

Right now we're looking for both high and low level design and input.

One thing I’m curious about is whether we would gain benefit from explicitly 
managing resources as state machines. I’m not very familiar with TaskFlow, but 
my impression is that it basically knows how to run a defined workflow through 
multiple steps until completion. Heat resources will, with this change, become 
objects that need to react to inputs at any point in time, so I wonder if it’s 
better to model them as a finite state machine instead of just with workflows.

Granted, I’m pretty unfamiliar with TaskFlow, so I may be off the mark here. I 
would like to point out that a new very simple but concise FSM-modeling library 
was recently released called “Machinist”, and it may be worth taking a look at: 
https://github.com/hybridcluster/machinist

--

--
Christopher Armstrong
IRC: radix


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat][Summit] Input wanted - real world heat spec

2014-04-24 Thread Zane Bitter

On 23/04/14 20:45, Robert Collins wrote:

Hi, we've got this summit session planned -
http://summit.openstack.org/cfp/details/428 which is really about
https://etherpad.openstack.org/p/heat-workflow-vs-convergence

We'd love feedback and questions - this is a significant amount of
work, but work I (and many others based on responses so far) believe
it is needed to really take Heat to users and ops teams.

Right now we're looking for both high and low level design and input.

If you're an operator/user/developer of/with/around heat - please take
a couple of minutes to look - feedback inline in the etherpad, or here
on the list - whatever suits you.

The basic idea is:
  - no changes needed to the heat template language etc


+1 for this part, definitely :)


  - take a holistic view and fix the system's emergent properties by
using a different baseline architecture within it
  - ???
  - profit!


Thanks for writing this up Rob. This is certainly a more ambitious scale 
of application to deploy than we ever envisioned in the early days of 
Heat ;) But I firmly believe that what is good for TripleO will be great 
for the rest of our users too. All of the observed issues mentioned are 
things we definitely want to address.


I have a few questions about the specific architecture being proposed. 
It's not clear to me what you mean by call-stack style in referring to 
the current paradigm. Maybe you could elaborate on how the current style 
and the convergence style differ.


Specifically, I am not clear on whether 'convergence' means:
 (a) Heat continues to respect the dependency graph but does not stop 
after one traversal, instead repeatedly processing it until (and even 
after) the stack is complete; or
 (b) Heat ignores the dependency graph and just throws everything 
against the wall, repeating until it has all stuck.


I also have doubts about the principle Users should only need to 
intervene with a stack when there is no right action that Heat can take 
to deliver the current template+parameters. That sounds good in theory, 
but in practice it's very hard to know when there is a right action Heat 
can take and when there isn't. e.g. There are innumerable ways to create 
a template that can _never_ actually converge, and I don't believe 
there's a general way we can detect that, only the hard way: one error 
type at a time, for every single resource type. Offering users a way to 
control how and when that happens allows them to make the best decisions 
for their particular circumstances - and hopefully a future WFaaS like 
Mistral will make it easy to set up continuous monitoring for those who 
require it. (Not incidentally, it also gives cloud operators an 
opportunity to charge their users in proportion to their actual 
requirements.)



This can be constrasted with many other existing attempts to design
solutions which relied on keeping the basic internals of heat as-is
and just tweaking things - an approach we don't believe will work -
the issues arise from the current architecture, not the quality of the
code (which is fine).


Some of the ideas that have been proposed in the past:

- Moving execution of operations on individual resources to a 
distributed execution system using taskflow. (This should address the 
scalability issue.)
- Updating the stored template in real time during stack updates - this 
is happening in Juno btw. (This will solve the problem of inability to 
ever recover from an update failure. In theory, it would also make it 
possible to interrupt a running update and make changes.)
- Implementing a 'stack converge' operation that the user can trigger to 
compare the actual state of the stack with the model and bring it back 
into spec.


It would be interesting to see some analysis on exactly how these 
existing attempts fall down in trying to fulfil the goals, as well as 
the specific points at which the proposed implementation differs.


Depending on the answers to the above questions, this proposal could be 
anything between a modest reworking of those existing ideas and a 
complete re-imagining of the entire concept of Heat. I'd very much like 
to find out where along that spectrum it lies :)


BTW, it appears that the schedule you're suggesting involves assigning a 
bunch of people unfamiliar with the current code base and having them 
complete a ground-up rearchitecting of the whole engine, all within the 
Juno development cycle (about 3.5 months). This is simply not consistent 
with reality as I have observed it up to this point.


cheers,
Zane.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat][Summit] Input wanted - real world heat spec

2014-04-24 Thread Joshua Harlow
Chiming in,

I'd like taskflow to get into the state-machine area (it's been discussed for a 
while [1]). It also does the workflow through defined steps[2] until completion 
(which imho is a subset in a way of the fully changeable state-machine). It 
also tries to add more value since when u declaratively define what the 'work' 
should be (within limits) u can then have taskflow execute it (via [3]), making 
imho your code less complicated (although it at the same time does at a new 
layer in), and letting taskflow try to help make it reliable for u (helping u 
do things like resume from a crash, run your code remotely via workers). Of 
course the library is actively developed (jump on #openstack-state-management) 
so additions like this or something like in the mentioned machinist library, 
which seems like just the foundational state-machine classes, could be hooked 
in/added.

The trouble I've had and that will likely be had with a approach like this is 
imho the complexity of it (and how much structuring code will there be, aka 
boilerplate); even currently taskflow already imposes some boilerplate 
(task/flow/engines objects) and mindset changes to adopting projects. A 
state-machine would impose similar + more (the states u could think of as task 
objects, the transitions would have to be some type of table, the reactions 
would be ?? and so on). Going down this path imho has to be done carefully and 
with consideration (and nothing will likely be perfect). This has always made 
me hesitate a little, in that it seems to add a lot of complexity that if not 
done carefully will cause more pain than goodness (the yin and yang). This is 
where I'd rather carefully figure out what this state-machine looks like 
(machinst looks to be the raw state-machine building blocks) and how it 
will/could be used and what benefit will it be bringing short-term and 
long-term. But I guess the one way to do it is try  learn (if u don't try u 
will never learn).

Certain other things that become interesting questions for taskflow and any 
type of state-machine (concepts that taskflow has that are being used that 
aren't typically thought about):

1. Persistence [4] (see how it is used)
2. What does undoing/reverting/resuming a state-machine even mean?

[1] https://etherpad.openstack.org/p/CinderTaskFlowFSM
[2] http://docs.openstack.org/developer/taskflow/states.html
[3] http://docs.openstack.org/developer/taskflow/engines.html
[4] http://docs.openstack.org/developer/taskflow/persistence.html

Anyways just some thoughts.

-Josh

From: Chris Armstrong 
chris.armstr...@rackspace.commailto:chris.armstr...@rackspace.com
Reply-To: OpenStack Development Mailing List (not for usage questions) 
openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
Date: Thursday, April 24, 2014 at 9:49 AM
To: OpenStack Development Mailing List (not for usage questions) 
openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [Heat][Summit] Input wanted - real world heat spec

On April 23, 2014 at 7:47:37 PM, Robert Collins 
(robe...@robertcollins.netmailto:robe...@robertcollins.net) wrote:
Hi, we've got this summit session planned -
http://summit.openstack.org/cfp/details/428 which is really about
https://etherpad.openstack.org/p/heat-workflow-vs-convergence

We'd love feedback and questions - this is a significant amount of
work, but work I (and many others based on responses so far) believe
it is needed to really take Heat to users and ops teams.

Right now we're looking for both high and low level design and input.

One thing I’m curious about is whether we would gain benefit from explicitly 
managing resources as state machines. I’m not very familiar with TaskFlow, but 
my impression is that it basically knows how to run a defined workflow through 
multiple steps until completion. Heat resources will, with this change, become 
objects that need to react to inputs at any point in time, so I wonder if it’s 
better to model them as a finite state machine instead of just with workflows.

Granted, I’m pretty unfamiliar with TaskFlow, so I may be off the mark here. I 
would like to point out that a new very simple but concise FSM-modeling library 
was recently released called “Machinist”, and it may be worth taking a look at: 
https://github.com/hybridcluster/machinist

--

--
Christopher Armstrong
IRC: radix


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat][Summit] Input wanted - real world heat spec

2014-04-24 Thread Clint Byrum
Excerpts from Zane Bitter's message of 2014-04-24 14:23:38 -0700:
 On 23/04/14 20:45, Robert Collins wrote:
  Hi, we've got this summit session planned -
  http://summit.openstack.org/cfp/details/428 which is really about
  https://etherpad.openstack.org/p/heat-workflow-vs-convergence
 
  We'd love feedback and questions - this is a significant amount of
  work, but work I (and many others based on responses so far) believe
  it is needed to really take Heat to users and ops teams.
 
  Right now we're looking for both high and low level design and input.
 
  If you're an operator/user/developer of/with/around heat - please take
  a couple of minutes to look - feedback inline in the etherpad, or here
  on the list - whatever suits you.
 
  The basic idea is:
- no changes needed to the heat template language etc
 
 +1 for this part, definitely :)
 
- take a holistic view and fix the system's emergent properties by
  using a different baseline architecture within it
- ???
- profit!
 
 Thanks for writing this up Rob. This is certainly a more ambitious scale 
 of application to deploy than we ever envisioned in the early days of 
 Heat ;) But I firmly believe that what is good for TripleO will be great 
 for the rest of our users too. All of the observed issues mentioned are 
 things we definitely want to address.
 
 I have a few questions about the specific architecture being proposed. 
 It's not clear to me what you mean by call-stack style in referring to 
 the current paradigm. Maybe you could elaborate on how the current style 
 and the convergence style differ.
 
 Specifically, I am not clear on whether 'convergence' means:
   (a) Heat continues to respect the dependency graph but does not stop 
 after one traversal, instead repeatedly processing it until (and even 
 after) the stack is complete; or
   (b) Heat ignores the dependency graph and just throws everything 
 against the wall, repeating until it has all stuck.
 

I think (c). We still have the graph driving what to do next so that
the things are more likely to stick. Also we don't want to do 10,000
instance creations if the database they need isn't going to come
available.

But we decouple I need to do something from The user asked for
something by allowing the convergence engine to act on notifications
from the observer engine. In addition to allowing more automated actions,
it should allow us to use finer grained locking because no individual
task will need to depend on the whole graph or stack. If an operator
comes along and changes templates or parameters, we can still complete
our outdated action. Eventually convergence will arrive at a state which
matches the desired stack.

 I also have doubts about the principle Users should only need to 
 intervene with a stack when there is no right action that Heat can take 
 to deliver the current template+parameters. That sounds good in theory, 
 but in practice it's very hard to know when there is a right action Heat 
 can take and when there isn't. e.g. There are innumerable ways to create 
 a template that can _never_ actually converge, and I don't believe 
 there's a general way we can detect that, only the hard way: one error 
 type at a time, for every single resource type. Offering users a way to 
 control how and when that happens allows them to make the best decisions 
 for their particular circumstances - and hopefully a future WFaaS like 
 Mistral will make it easy to set up continuous monitoring for those who 
 require it. (Not incidentally, it also gives cloud operators an 
 opportunity to charge their users in proportion to their actual 
 requirements.)
 

There are some obvious times where there _is_ a clear automated answer
that does not require me to defer to a user's special workflow. 503 or
429 (I know, not ratified yet) status codes mean I should retry after
maybe backing off a bit. If I get an ERROR state on a nova VM, I should
retry a few times before giving up.

The point isn't that we have all the answers, it is that there are
plenty of places where where we do have good answers that will serve
most users well.

  This can be constrasted with many other existing attempts to design
  solutions which relied on keeping the basic internals of heat as-is
  and just tweaking things - an approach we don't believe will work -
  the issues arise from the current architecture, not the quality of the
  code (which is fine).
 
 Some of the ideas that have been proposed in the past:
 
 - Moving execution of operations on individual resources to a 
 distributed execution system using taskflow. (This should address the 
 scalability issue.)

This is a superset of that. The same work that was going to be required
there, will be required for this. We can't be loading the whole stack
just to do a single operation on a single resource.

 - Updating the stored template in real time during stack updates - this 
 is happening in Juno btw. (This will solve the problem of inability 

Re: [openstack-dev] [Heat][Summit] Input wanted - real world heat spec

2014-04-24 Thread Qiming Teng
  Specifically, I am not clear on whether 'convergence' means:
(a) Heat continues to respect the dependency graph but does not stop 
  after one traversal, instead repeatedly processing it until (and even 
  after) the stack is complete; or
(b) Heat ignores the dependency graph and just throws everything 
  against the wall, repeating until it has all stuck.
  
 
 I think (c). We still have the graph driving what to do next so that
 the things are more likely to stick. Also we don't want to do 10,000
 instance creations if the database they need isn't going to come
 available.
 
 But we decouple I need to do something from The user asked for
 something by allowing the convergence engine to act on notifications
 from the observer engine. In addition to allowing more automated actions,
 it should allow us to use finer grained locking because no individual
 task will need to depend on the whole graph or stack. If an operator
 comes along and changes templates or parameters, we can still complete
 our outdated action. Eventually convergence will arrive at a state which
 matches the desired stack.

There could be live/dead locks if the granularity becomes smaller. Need
some ruling design to avoid it before we find it too difficult to debug.

  I also have doubts about the principle Users should only need to 
  intervene with a stack when there is no right action that Heat can take 
  to deliver the current template+parameters. That sounds good in theory, 
  but in practice it's very hard to know when there is a right action Heat 
  can take and when there isn't. e.g. There are innumerable ways to create 
  a template that can _never_ actually converge, and I don't believe 
  there's a general way we can detect that, only the hard way: one error 
  type at a time, for every single resource type. Offering users a way to 
  control how and when that happens allows them to make the best decisions 
  for their particular circumstances - and hopefully a future WFaaS like 
  Mistral will make it easy to set up continuous monitoring for those who 
  require it. (Not incidentally, it also gives cloud operators an 
  opportunity to charge their users in proportion to their actual 
  requirements.)
  
 
 There are some obvious times where there _is_ a clear automated answer
 that does not require me to defer to a user's special workflow. 503 or
 429 (I know, not ratified yet) status codes mean I should retry after
 maybe backing off a bit. If I get an ERROR state on a nova VM, I should
 retry a few times before giving up.

+1 on this.

 The point isn't that we have all the answers, it is that there are
 plenty of places where where we do have good answers that will serve
 most users well.

Right. I would expect all resources in Heat to be wrapped (encapsulated)
very well that they know how to handle most events.  Well, in some
cases, additional hints are expected/needed from the events.  If a
resource doesn't know how to respond to an event, we provide a default
(well-defined) propagation path for the message.  Assuming this can be
done, we only have to deal with some macro-level complexities where an
external workflow is needed.
 
 This obsoletes that. We don't need to keep track if we adopt a convergence
 model. The template that the user has asked for, is the template we
 converge on. The diff between that and reality dictates the changes we
 need to make. Wherever we're at with the convergence step that was last
 triggered can just be cancelled by the new one.

Seems that we need a protocol for cancelling an operation then ...


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat][Summit] Input wanted - real world heat spec

2014-04-24 Thread Robert Collins
On 25 April 2014 14:31, Qiming Teng teng...@linux.vnet.ibm.com wrote:

s needed.

 This obsoletes that. We don't need to keep track if we adopt a convergence
 model. The template that the user has asked for, is the template we
 converge on. The diff between that and reality dictates the changes we
 need to make. Wherever we're at with the convergence step that was last
 triggered can just be cancelled by the new one.

 Seems that we need a protocol for cancelling an operation then ...

I think Clint meant 'undoes' not 'cancels the in-progress code'.

-Rob


-- 
Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat][Summit] Input wanted - real world heat spec

2014-04-24 Thread Robert Collins
On 25 April 2014 04:49, Chris Armstrong chris.armstr...@rackspace.com wrote:
 On April 23, 2014 at 7:47:37 PM, Robert Collins (robe...@robertcollins.net)
 wrote:

 Hi, we've got this summit session planned -
 http://summit.openstack.org/cfp/details/428 which is really about
 https://etherpad.openstack.org/p/heat-workflow-vs-convergence

 We'd love feedback and questions - this is a significant amount of
 work, but work I (and many others based on responses so far) believe
 it is needed to really take Heat to users and ops teams.

 Right now we're looking for both high and low level design and input.

 One thing I’m curious about is whether we would gain benefit from explicitly
 managing resources as state machines. I’m not very familiar with TaskFlow,
 but my impression is that it basically knows how to run a defined workflow
 through multiple steps until completion. Heat resources will, with this
 change, become objects that need to react to inputs at any point in time, so
 I wonder if it’s better to model them as a finite state machine instead of
 just with workflows.

 Granted, I’m pretty unfamiliar with TaskFlow, so I may be off the mark here.
 I would like to point out that a new very simple but concise FSM-modeling
 library was recently released called “Machinist”, and it may be worth taking
 a look at: https://github.com/hybridcluster/machinist

Directly writing the mgmt code in an FSM structure would be pretty
cool I think. It is also perhaps orthogonal, but well worth some
closer examination. Can you perhaps sketch something up for folk to
eyeball?

As far as I see TaskFlow for the current proposal - we're basically
getting 'run a function' as an action, so its a lot simpler in
concept.

-Rob

-- 
Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat][Summit] Input wanted - real world heat spec

2014-04-24 Thread Robert Collins
On 25 April 2014 09:23, Zane Bitter zbit...@redhat.com wrote:

   - take a holistic view and fix the system's emergent properties by
 using a different baseline architecture within it
   - ???
   - profit!


 Thanks for writing this up Rob. This is certainly a more ambitious scale of
 application to deploy than we ever envisioned in the early days of Heat ;)
 But I firmly believe that what is good for TripleO will be great for the
 rest of our users too. All of the observed issues mentioned are things we
 definitely want to address.

 I have a few questions about the specific architecture being proposed. It's
 not clear to me what you mean by call-stack style in referring to the
 current paradigm. Maybe you could elaborate on how the current style and the
 convergence style differ.

So the call-stack style - we have an in-process data structure in the
heat engine which contains the traversal of the DAG. Its a bit awkward
to visualise because of the coroutine style layer in there - but if
you squash that back it starts to look like a regular callstack:

frame resource
0 root
1 root-A
2 root-A-B
3 root-A-B-C

(representing that we're bring up C which is a dep of B which is a dep
of A which hangs off the root).

The concurrency allowed by coroutines means this really is a tree of
callstacks - but as a style it has all the same characteristics:
 - code is called top-down
 - the thing being executed is live data in memory, and thus largely
untouchable from outside
 - the entire structure has to run to completion, or fail - it acts as
a single large 'procedure call'.

The style I'm proposing we use is one where:
 - code is called in response to events
 - we exit after taking the 'next step' in response to an event, so we
can be very responsive to changes in intent without requiring every
routine to support early-exit of some form; and the 'program'.
 - we can stop executing at any arbitrary point, because we're running
small units at a time.

 Specifically, I am not clear on whether 'convergence' means:
  (a) Heat continues to respect the dependency graph but does not stop after
 one traversal, instead repeatedly processing it until (and even after) the
 stack is complete; or
  (b) Heat ignores the dependency graph and just throws everything against
 the wall, repeating until it has all stuck.

Clint used (c), so I'll use (d).

d) Heat stops evaluating the whole graph and instead only evaluates
one node at a time before exiting. Further events (such as timeouts,
resources changing state, or the user requesting a change) trigger
Heat to evaluate a node.

 I also have doubts about the principle Users should only need to intervene
 with a stack when there is no right action that Heat can take to deliver the
 current template+parameters. That sounds good in theory, but in practice
 it's very hard to know when there is a right action Heat can take and when
 there isn't. e.g. There are innumerable ways to create a template that can
 _never_ actually converge, and I don't believe there's a general way we can
 detect that, only the hard way: one error type at a time, for every single
 resource type. Offering users a way to control how and when that happens

I agree with the innumerable ways - thats a hard truth. For instance,
if nova is sick, instances may never come up, and trying forever to
spawn something that can't spawn is pointless.

However, Nova instance spawn success rates in many clouds (e.g.
rackspace and HP) are much less than 100% - treating a failed instance
spawn as an error is totally unrealistic. I contend that its Heat's
job to 'do what needs to be done' to get that nova instance, and if it
decides it cannot, then and only then to signal error higher up (which
for e.g. a scaling group might be to not error *at all* but just to
try another one).

Hmm, lets try this another way:
 - 'failed-but-retryable' at a local scope is well defined but hard to
code for (because as you say we have to add types to catch one at a
time, per resource type).
 - 'failed'  at a local scope is well defined - any exception we don't catch :)

BUT 'failed' at a higher level is not well defined: what does 'failed'
mean for a scaling group? I don't think its reasonable that a single
non-retryable API error in one of the nested stacks should invalidate
a scaling group as a whole. Now, lets go back to considering the local
scope of a single resource - if we ask Nova for an instance, and it
goes BUILDING-SPAWNING-ERROR, is that 'retryable'? I actually think
that 'retry' here on a per-error-code basis makes sense: what makes
sense is 'did the resource become usable? No - try harder until
timeout. Yes? - look holistically (e.g. DELETION_POLICY, is it in a
scaling group) to decide if its recoverable.

So generally speaking we can detect 'failed to converge in X period
hours' - and if you examine existing prior art that works in
production with Nova - things like 'nodepool' - thats exactly what
they do (the timeout in nodepool is