Re: [openstack-dev] [Heat][Summit] Input wanted - real world heat spec
On 25/04/14 11:29, Clint Byrum wrote: Also by loading the whole stack we've allowed resources to bleed into other resource. Currently to read Metadata for a single item that entails _a lot_ of queries to the database because we end up having to load the entire stack. We can't continue that as stacks grow in size. As an aside, this changeset[1] results in a stack load requiring _one_ query instead of _a lot_. Clint's argument still stands though. [1] https://review.openstack.org/#/q/status:open+project:openstack/heat+branch:master+topic:bug/1306743,n,z ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Heat][Summit] Input wanted - real world heat spec
On April 23, 2014 at 7:47:37 PM, Robert Collins (robe...@robertcollins.netmailto:robe...@robertcollins.net) wrote: Hi, we've got this summit session planned - http://summit.openstack.org/cfp/details/428 which is really about https://etherpad.openstack.org/p/heat-workflow-vs-convergence We'd love feedback and questions - this is a significant amount of work, but work I (and many others based on responses so far) believe it is needed to really take Heat to users and ops teams. Right now we're looking for both high and low level design and input. One thing I’m curious about is whether we would gain benefit from explicitly managing resources as state machines. I’m not very familiar with TaskFlow, but my impression is that it basically knows how to run a defined workflow through multiple steps until completion. Heat resources will, with this change, become objects that need to react to inputs at any point in time, so I wonder if it’s better to model them as a finite state machine instead of just with workflows. Granted, I’m pretty unfamiliar with TaskFlow, so I may be off the mark here. I would like to point out that a new very simple but concise FSM-modeling library was recently released called “Machinist”, and it may be worth taking a look at: https://github.com/hybridcluster/machinist -- -- Christopher Armstrong IRC: radix ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Heat][Summit] Input wanted - real world heat spec
On 23/04/14 20:45, Robert Collins wrote: Hi, we've got this summit session planned - http://summit.openstack.org/cfp/details/428 which is really about https://etherpad.openstack.org/p/heat-workflow-vs-convergence We'd love feedback and questions - this is a significant amount of work, but work I (and many others based on responses so far) believe it is needed to really take Heat to users and ops teams. Right now we're looking for both high and low level design and input. If you're an operator/user/developer of/with/around heat - please take a couple of minutes to look - feedback inline in the etherpad, or here on the list - whatever suits you. The basic idea is: - no changes needed to the heat template language etc +1 for this part, definitely :) - take a holistic view and fix the system's emergent properties by using a different baseline architecture within it - ??? - profit! Thanks for writing this up Rob. This is certainly a more ambitious scale of application to deploy than we ever envisioned in the early days of Heat ;) But I firmly believe that what is good for TripleO will be great for the rest of our users too. All of the observed issues mentioned are things we definitely want to address. I have a few questions about the specific architecture being proposed. It's not clear to me what you mean by call-stack style in referring to the current paradigm. Maybe you could elaborate on how the current style and the convergence style differ. Specifically, I am not clear on whether 'convergence' means: (a) Heat continues to respect the dependency graph but does not stop after one traversal, instead repeatedly processing it until (and even after) the stack is complete; or (b) Heat ignores the dependency graph and just throws everything against the wall, repeating until it has all stuck. I also have doubts about the principle Users should only need to intervene with a stack when there is no right action that Heat can take to deliver the current template+parameters. That sounds good in theory, but in practice it's very hard to know when there is a right action Heat can take and when there isn't. e.g. There are innumerable ways to create a template that can _never_ actually converge, and I don't believe there's a general way we can detect that, only the hard way: one error type at a time, for every single resource type. Offering users a way to control how and when that happens allows them to make the best decisions for their particular circumstances - and hopefully a future WFaaS like Mistral will make it easy to set up continuous monitoring for those who require it. (Not incidentally, it also gives cloud operators an opportunity to charge their users in proportion to their actual requirements.) This can be constrasted with many other existing attempts to design solutions which relied on keeping the basic internals of heat as-is and just tweaking things - an approach we don't believe will work - the issues arise from the current architecture, not the quality of the code (which is fine). Some of the ideas that have been proposed in the past: - Moving execution of operations on individual resources to a distributed execution system using taskflow. (This should address the scalability issue.) - Updating the stored template in real time during stack updates - this is happening in Juno btw. (This will solve the problem of inability to ever recover from an update failure. In theory, it would also make it possible to interrupt a running update and make changes.) - Implementing a 'stack converge' operation that the user can trigger to compare the actual state of the stack with the model and bring it back into spec. It would be interesting to see some analysis on exactly how these existing attempts fall down in trying to fulfil the goals, as well as the specific points at which the proposed implementation differs. Depending on the answers to the above questions, this proposal could be anything between a modest reworking of those existing ideas and a complete re-imagining of the entire concept of Heat. I'd very much like to find out where along that spectrum it lies :) BTW, it appears that the schedule you're suggesting involves assigning a bunch of people unfamiliar with the current code base and having them complete a ground-up rearchitecting of the whole engine, all within the Juno development cycle (about 3.5 months). This is simply not consistent with reality as I have observed it up to this point. cheers, Zane. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Heat][Summit] Input wanted - real world heat spec
Chiming in, I'd like taskflow to get into the state-machine area (it's been discussed for a while [1]). It also does the workflow through defined steps[2] until completion (which imho is a subset in a way of the fully changeable state-machine). It also tries to add more value since when u declaratively define what the 'work' should be (within limits) u can then have taskflow execute it (via [3]), making imho your code less complicated (although it at the same time does at a new layer in), and letting taskflow try to help make it reliable for u (helping u do things like resume from a crash, run your code remotely via workers). Of course the library is actively developed (jump on #openstack-state-management) so additions like this or something like in the mentioned machinist library, which seems like just the foundational state-machine classes, could be hooked in/added. The trouble I've had and that will likely be had with a approach like this is imho the complexity of it (and how much structuring code will there be, aka boilerplate); even currently taskflow already imposes some boilerplate (task/flow/engines objects) and mindset changes to adopting projects. A state-machine would impose similar + more (the states u could think of as task objects, the transitions would have to be some type of table, the reactions would be ?? and so on). Going down this path imho has to be done carefully and with consideration (and nothing will likely be perfect). This has always made me hesitate a little, in that it seems to add a lot of complexity that if not done carefully will cause more pain than goodness (the yin and yang). This is where I'd rather carefully figure out what this state-machine looks like (machinst looks to be the raw state-machine building blocks) and how it will/could be used and what benefit will it be bringing short-term and long-term. But I guess the one way to do it is try learn (if u don't try u will never learn). Certain other things that become interesting questions for taskflow and any type of state-machine (concepts that taskflow has that are being used that aren't typically thought about): 1. Persistence [4] (see how it is used) 2. What does undoing/reverting/resuming a state-machine even mean? [1] https://etherpad.openstack.org/p/CinderTaskFlowFSM [2] http://docs.openstack.org/developer/taskflow/states.html [3] http://docs.openstack.org/developer/taskflow/engines.html [4] http://docs.openstack.org/developer/taskflow/persistence.html Anyways just some thoughts. -Josh From: Chris Armstrong chris.armstr...@rackspace.commailto:chris.armstr...@rackspace.com Reply-To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org Date: Thursday, April 24, 2014 at 9:49 AM To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Heat][Summit] Input wanted - real world heat spec On April 23, 2014 at 7:47:37 PM, Robert Collins (robe...@robertcollins.netmailto:robe...@robertcollins.net) wrote: Hi, we've got this summit session planned - http://summit.openstack.org/cfp/details/428 which is really about https://etherpad.openstack.org/p/heat-workflow-vs-convergence We'd love feedback and questions - this is a significant amount of work, but work I (and many others based on responses so far) believe it is needed to really take Heat to users and ops teams. Right now we're looking for both high and low level design and input. One thing I’m curious about is whether we would gain benefit from explicitly managing resources as state machines. I’m not very familiar with TaskFlow, but my impression is that it basically knows how to run a defined workflow through multiple steps until completion. Heat resources will, with this change, become objects that need to react to inputs at any point in time, so I wonder if it’s better to model them as a finite state machine instead of just with workflows. Granted, I’m pretty unfamiliar with TaskFlow, so I may be off the mark here. I would like to point out that a new very simple but concise FSM-modeling library was recently released called “Machinist”, and it may be worth taking a look at: https://github.com/hybridcluster/machinist -- -- Christopher Armstrong IRC: radix ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Heat][Summit] Input wanted - real world heat spec
Excerpts from Zane Bitter's message of 2014-04-24 14:23:38 -0700: On 23/04/14 20:45, Robert Collins wrote: Hi, we've got this summit session planned - http://summit.openstack.org/cfp/details/428 which is really about https://etherpad.openstack.org/p/heat-workflow-vs-convergence We'd love feedback and questions - this is a significant amount of work, but work I (and many others based on responses so far) believe it is needed to really take Heat to users and ops teams. Right now we're looking for both high and low level design and input. If you're an operator/user/developer of/with/around heat - please take a couple of minutes to look - feedback inline in the etherpad, or here on the list - whatever suits you. The basic idea is: - no changes needed to the heat template language etc +1 for this part, definitely :) - take a holistic view and fix the system's emergent properties by using a different baseline architecture within it - ??? - profit! Thanks for writing this up Rob. This is certainly a more ambitious scale of application to deploy than we ever envisioned in the early days of Heat ;) But I firmly believe that what is good for TripleO will be great for the rest of our users too. All of the observed issues mentioned are things we definitely want to address. I have a few questions about the specific architecture being proposed. It's not clear to me what you mean by call-stack style in referring to the current paradigm. Maybe you could elaborate on how the current style and the convergence style differ. Specifically, I am not clear on whether 'convergence' means: (a) Heat continues to respect the dependency graph but does not stop after one traversal, instead repeatedly processing it until (and even after) the stack is complete; or (b) Heat ignores the dependency graph and just throws everything against the wall, repeating until it has all stuck. I think (c). We still have the graph driving what to do next so that the things are more likely to stick. Also we don't want to do 10,000 instance creations if the database they need isn't going to come available. But we decouple I need to do something from The user asked for something by allowing the convergence engine to act on notifications from the observer engine. In addition to allowing more automated actions, it should allow us to use finer grained locking because no individual task will need to depend on the whole graph or stack. If an operator comes along and changes templates or parameters, we can still complete our outdated action. Eventually convergence will arrive at a state which matches the desired stack. I also have doubts about the principle Users should only need to intervene with a stack when there is no right action that Heat can take to deliver the current template+parameters. That sounds good in theory, but in practice it's very hard to know when there is a right action Heat can take and when there isn't. e.g. There are innumerable ways to create a template that can _never_ actually converge, and I don't believe there's a general way we can detect that, only the hard way: one error type at a time, for every single resource type. Offering users a way to control how and when that happens allows them to make the best decisions for their particular circumstances - and hopefully a future WFaaS like Mistral will make it easy to set up continuous monitoring for those who require it. (Not incidentally, it also gives cloud operators an opportunity to charge their users in proportion to their actual requirements.) There are some obvious times where there _is_ a clear automated answer that does not require me to defer to a user's special workflow. 503 or 429 (I know, not ratified yet) status codes mean I should retry after maybe backing off a bit. If I get an ERROR state on a nova VM, I should retry a few times before giving up. The point isn't that we have all the answers, it is that there are plenty of places where where we do have good answers that will serve most users well. This can be constrasted with many other existing attempts to design solutions which relied on keeping the basic internals of heat as-is and just tweaking things - an approach we don't believe will work - the issues arise from the current architecture, not the quality of the code (which is fine). Some of the ideas that have been proposed in the past: - Moving execution of operations on individual resources to a distributed execution system using taskflow. (This should address the scalability issue.) This is a superset of that. The same work that was going to be required there, will be required for this. We can't be loading the whole stack just to do a single operation on a single resource. - Updating the stored template in real time during stack updates - this is happening in Juno btw. (This will solve the problem of inability
Re: [openstack-dev] [Heat][Summit] Input wanted - real world heat spec
Specifically, I am not clear on whether 'convergence' means: (a) Heat continues to respect the dependency graph but does not stop after one traversal, instead repeatedly processing it until (and even after) the stack is complete; or (b) Heat ignores the dependency graph and just throws everything against the wall, repeating until it has all stuck. I think (c). We still have the graph driving what to do next so that the things are more likely to stick. Also we don't want to do 10,000 instance creations if the database they need isn't going to come available. But we decouple I need to do something from The user asked for something by allowing the convergence engine to act on notifications from the observer engine. In addition to allowing more automated actions, it should allow us to use finer grained locking because no individual task will need to depend on the whole graph or stack. If an operator comes along and changes templates or parameters, we can still complete our outdated action. Eventually convergence will arrive at a state which matches the desired stack. There could be live/dead locks if the granularity becomes smaller. Need some ruling design to avoid it before we find it too difficult to debug. I also have doubts about the principle Users should only need to intervene with a stack when there is no right action that Heat can take to deliver the current template+parameters. That sounds good in theory, but in practice it's very hard to know when there is a right action Heat can take and when there isn't. e.g. There are innumerable ways to create a template that can _never_ actually converge, and I don't believe there's a general way we can detect that, only the hard way: one error type at a time, for every single resource type. Offering users a way to control how and when that happens allows them to make the best decisions for their particular circumstances - and hopefully a future WFaaS like Mistral will make it easy to set up continuous monitoring for those who require it. (Not incidentally, it also gives cloud operators an opportunity to charge their users in proportion to their actual requirements.) There are some obvious times where there _is_ a clear automated answer that does not require me to defer to a user's special workflow. 503 or 429 (I know, not ratified yet) status codes mean I should retry after maybe backing off a bit. If I get an ERROR state on a nova VM, I should retry a few times before giving up. +1 on this. The point isn't that we have all the answers, it is that there are plenty of places where where we do have good answers that will serve most users well. Right. I would expect all resources in Heat to be wrapped (encapsulated) very well that they know how to handle most events. Well, in some cases, additional hints are expected/needed from the events. If a resource doesn't know how to respond to an event, we provide a default (well-defined) propagation path for the message. Assuming this can be done, we only have to deal with some macro-level complexities where an external workflow is needed. This obsoletes that. We don't need to keep track if we adopt a convergence model. The template that the user has asked for, is the template we converge on. The diff between that and reality dictates the changes we need to make. Wherever we're at with the convergence step that was last triggered can just be cancelled by the new one. Seems that we need a protocol for cancelling an operation then ... ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Heat][Summit] Input wanted - real world heat spec
On 25 April 2014 14:31, Qiming Teng teng...@linux.vnet.ibm.com wrote: s needed. This obsoletes that. We don't need to keep track if we adopt a convergence model. The template that the user has asked for, is the template we converge on. The diff between that and reality dictates the changes we need to make. Wherever we're at with the convergence step that was last triggered can just be cancelled by the new one. Seems that we need a protocol for cancelling an operation then ... I think Clint meant 'undoes' not 'cancels the in-progress code'. -Rob -- Robert Collins rbtcoll...@hp.com Distinguished Technologist HP Converged Cloud ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Heat][Summit] Input wanted - real world heat spec
On 25 April 2014 04:49, Chris Armstrong chris.armstr...@rackspace.com wrote: On April 23, 2014 at 7:47:37 PM, Robert Collins (robe...@robertcollins.net) wrote: Hi, we've got this summit session planned - http://summit.openstack.org/cfp/details/428 which is really about https://etherpad.openstack.org/p/heat-workflow-vs-convergence We'd love feedback and questions - this is a significant amount of work, but work I (and many others based on responses so far) believe it is needed to really take Heat to users and ops teams. Right now we're looking for both high and low level design and input. One thing I’m curious about is whether we would gain benefit from explicitly managing resources as state machines. I’m not very familiar with TaskFlow, but my impression is that it basically knows how to run a defined workflow through multiple steps until completion. Heat resources will, with this change, become objects that need to react to inputs at any point in time, so I wonder if it’s better to model them as a finite state machine instead of just with workflows. Granted, I’m pretty unfamiliar with TaskFlow, so I may be off the mark here. I would like to point out that a new very simple but concise FSM-modeling library was recently released called “Machinist”, and it may be worth taking a look at: https://github.com/hybridcluster/machinist Directly writing the mgmt code in an FSM structure would be pretty cool I think. It is also perhaps orthogonal, but well worth some closer examination. Can you perhaps sketch something up for folk to eyeball? As far as I see TaskFlow for the current proposal - we're basically getting 'run a function' as an action, so its a lot simpler in concept. -Rob -- Robert Collins rbtcoll...@hp.com Distinguished Technologist HP Converged Cloud ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Heat][Summit] Input wanted - real world heat spec
On 25 April 2014 09:23, Zane Bitter zbit...@redhat.com wrote: - take a holistic view and fix the system's emergent properties by using a different baseline architecture within it - ??? - profit! Thanks for writing this up Rob. This is certainly a more ambitious scale of application to deploy than we ever envisioned in the early days of Heat ;) But I firmly believe that what is good for TripleO will be great for the rest of our users too. All of the observed issues mentioned are things we definitely want to address. I have a few questions about the specific architecture being proposed. It's not clear to me what you mean by call-stack style in referring to the current paradigm. Maybe you could elaborate on how the current style and the convergence style differ. So the call-stack style - we have an in-process data structure in the heat engine which contains the traversal of the DAG. Its a bit awkward to visualise because of the coroutine style layer in there - but if you squash that back it starts to look like a regular callstack: frame resource 0 root 1 root-A 2 root-A-B 3 root-A-B-C (representing that we're bring up C which is a dep of B which is a dep of A which hangs off the root). The concurrency allowed by coroutines means this really is a tree of callstacks - but as a style it has all the same characteristics: - code is called top-down - the thing being executed is live data in memory, and thus largely untouchable from outside - the entire structure has to run to completion, or fail - it acts as a single large 'procedure call'. The style I'm proposing we use is one where: - code is called in response to events - we exit after taking the 'next step' in response to an event, so we can be very responsive to changes in intent without requiring every routine to support early-exit of some form; and the 'program'. - we can stop executing at any arbitrary point, because we're running small units at a time. Specifically, I am not clear on whether 'convergence' means: (a) Heat continues to respect the dependency graph but does not stop after one traversal, instead repeatedly processing it until (and even after) the stack is complete; or (b) Heat ignores the dependency graph and just throws everything against the wall, repeating until it has all stuck. Clint used (c), so I'll use (d). d) Heat stops evaluating the whole graph and instead only evaluates one node at a time before exiting. Further events (such as timeouts, resources changing state, or the user requesting a change) trigger Heat to evaluate a node. I also have doubts about the principle Users should only need to intervene with a stack when there is no right action that Heat can take to deliver the current template+parameters. That sounds good in theory, but in practice it's very hard to know when there is a right action Heat can take and when there isn't. e.g. There are innumerable ways to create a template that can _never_ actually converge, and I don't believe there's a general way we can detect that, only the hard way: one error type at a time, for every single resource type. Offering users a way to control how and when that happens I agree with the innumerable ways - thats a hard truth. For instance, if nova is sick, instances may never come up, and trying forever to spawn something that can't spawn is pointless. However, Nova instance spawn success rates in many clouds (e.g. rackspace and HP) are much less than 100% - treating a failed instance spawn as an error is totally unrealistic. I contend that its Heat's job to 'do what needs to be done' to get that nova instance, and if it decides it cannot, then and only then to signal error higher up (which for e.g. a scaling group might be to not error *at all* but just to try another one). Hmm, lets try this another way: - 'failed-but-retryable' at a local scope is well defined but hard to code for (because as you say we have to add types to catch one at a time, per resource type). - 'failed' at a local scope is well defined - any exception we don't catch :) BUT 'failed' at a higher level is not well defined: what does 'failed' mean for a scaling group? I don't think its reasonable that a single non-retryable API error in one of the nested stacks should invalidate a scaling group as a whole. Now, lets go back to considering the local scope of a single resource - if we ask Nova for an instance, and it goes BUILDING-SPAWNING-ERROR, is that 'retryable'? I actually think that 'retry' here on a per-error-code basis makes sense: what makes sense is 'did the resource become usable? No - try harder until timeout. Yes? - look holistically (e.g. DELETION_POLICY, is it in a scaling group) to decide if its recoverable. So generally speaking we can detect 'failed to converge in X period hours' - and if you examine existing prior art that works in production with Nova - things like 'nodepool' - thats exactly what they do (the timeout in nodepool is