Re: [openstack-dev] [Mistral][TaskFlow] Long running actions
On Apr 1, 2014, at 3:43 AM, Renat Akhmerov rakhme...@mirantis.com wrote: On 25 Mar 2014, at 01:51, Joshua Harlow harlo...@yahoo-inc.com wrote: The first execution model I would call the local execution model, this model involves forming tasks and flows and then executing them inside an application, that application is running for the duration of the workflow (although if it crashes it can re-establish the task and flows that it was doing and attempt to resume them). This could also be what openstack projects would call the 'conductor' approach where nova, ironic, trove have a conductor which manages these long-running actions (the conductor is alive/running throughout the duration of these workflows, although it may be restarted while running). The restarting + resuming part is something that openstack hasn't handled so gracefully currently, typically requiring either some type of cleanup at restart (or by operations), with taskflow using this model the resumption part makes it possible to resume from the last saved state (this connects into the persistence model that taskflow uses, the state transitions, how execution occurrs itself...). The second execution model is an extension of the first, whereby there is still a type of 'conductor' that is managing the life-time of the workflow, but instead of locally executing tasks in the conductor itself tasks are now executed on remote-workers (see http://tinyurl.com/lf3yqe4 ). The engine currently still is 'alive' for the life-time of the execution, although the work that it is doing is relatively minimal (since its not actually executing any task code, but proxying those requests to others works). The engine while running does the conducting of the remote-workers (saving persistence details, doing state-transtions, getting results, sending requests to workers…). These two execution models are special cases of what you call “lazy execution model” (or passive as we call it). To illustrate this idea we can take a look at the first sequence diagram at [0], we basically will see the following interaction: 1) engine --(task)-- queue --(task)-- worker 2) execute task 3) worker --(result)-- queue --(result)-- engine This is how TaskFlow worker based model works. If we loosen the requirement in 3) and assume that not only worker can send a task result back to engine we’ll get our passive model. Instead of worker it can be anything else (some external system) that knows how to make this call. A particular way is not too important, it can be a direct message or it can be hidden behind an API method. In Mistral it’s now a REST API method however we’re about to decouple engine from REST API so that engine is a standalone process and listens to a queue. So worker-based model is basically the same with the only strict requirement that only worker sends a result back. In order to implement local execution model on top of “lazy execution model” we just need to abstract a transport (queue) so that we can use an in-process transport. That’s it. It’s what Mistral already has implemented. Again, we see that “lazy execution model” is more universal. IMO this “lazy execution model” should be the main execution model that TaskFlow supports, others can be easily implemented on top of it. But the opposite assertion is wrong. IMO this is the most important obstacle in all our discussions, the reason why we don’t always understand each other well enough. I know it may be a lot of work to shift a paradigm in TaskFlow team but if we did that we would get enough freedom for using TaskFlow in lots of cases. Let me know what you think. I might have missed something. DZ: Interesting idea! So that other models of execution are based on lazy execution model? TaskFlow implements this, we can use it, and for other clients more convenient higher level execution models are provided? Interesting. Makes sense. @Joshua? @Kirill? Others? === HA === So this is an interesting question, and to me is strongly connected to how your engines are executing (and the persistence and state-transitions that they go through while running). Without persistence of state and transitions there is no good way (a bad way of course can be created, by just redoing all the work, but that's not always feasible or the best option) to accomplish resuming in a sane manner and there is also imho no way to accomplish any type of automated HA of workflows. Sure, no questions here. Let me describe: When you save the states of a workflow and any intermediate results of a workflow to some database (for example) and the engine (see above models) which is being used (for example the conductor type from above) the application containing that engine may be prone to crashes (or just being powered off due to software upgrades...). Since taskflows key primitives were made to allow for
Re: [openstack-dev] [Mistral][TaskFlow] Long running actions
Inline responses. From: Renat Akhmerov rakhme...@mirantis.commailto:rakhme...@mirantis.com Reply-To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org Date: Tuesday, April 1, 2014 at 3:43 AM To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Mistral][TaskFlow] Long running actions On 25 Mar 2014, at 01:51, Joshua Harlow harlo...@yahoo-inc.commailto:harlo...@yahoo-inc.com wrote: The first execution model I would call the local execution model, this model involves forming tasks and flows and then executing them inside an application, that application is running for the duration of the workflow (although if it crashes it can re-establish the task and flows that it was doing and attempt to resume them). This could also be what openstack projects would call the 'conductor' approach where nova, ironic, trove have a conductor which manages these long-running actions (the conductor is alive/running throughout the duration of these workflows, although it may be restarted while running). The restarting + resuming part is something that openstack hasn't handled so gracefully currently, typically requiring either some type of cleanup at restart (or by operations), with taskflow using this model the resumption part makes it possible to resume from the last saved state (this connects into the persistence model that taskflow uses, the state transitions, how execution occurrs itself...). The second execution model is an extension of the first, whereby there is still a type of 'conductor' that is managing the life-time of the workflow, but instead of locally executing tasks in the conductor itself tasks are now executed on remote-workers (see http://tinyurl.com/lf3yqe4 ). The engine currently still is 'alive' for the life-time of the execution, although the work that it is doing is relatively minimal (since its not actually executing any task code, but proxying those requests to others works). The engine while running does the conducting of the remote-workers (saving persistence details, doing state-transtions, getting results, sending requests to workers…). These two execution models are special cases of what you call “lazy execution model” (or passive as we call it). To illustrate this idea we can take a look at the first sequence diagram at [0], we basically will see the following interaction: 1) engine --(task)-- queue --(task)-- worker 2) execute task 3) worker --(result)-- queue --(result)-- engine This is how TaskFlow worker based model works. If we loosen the requirement in 3) and assume that not only worker can send a task result back to engine we’ll get our passive model. Instead of worker it can be anything else (some external system) that knows how to make this call. A particular way is not too important, it can be a direct message or it can be hidden behind an API method. In Mistral it’s now a REST API method however we’re about to decouple engine from REST API so that engine is a standalone process and listens to a queue. So worker-based model is basically the same with the only strict requirement that only worker sends a result back. In order to implement local execution model on top of “lazy execution model” we just need to abstract a transport (queue) so that we can use an in-process transport. That’s it. It’s what Mistral already has implemented. Again, we see that “lazy execution model” is more universal. IMO this “lazy execution model” should be the main execution model that TaskFlow supports, others can be easily implemented on top of it. But the opposite assertion is wrong. IMO this is the most important obstacle in all our discussions, the reason why we don’t always understand each other well enough. I know it may be a lot of work to shift a paradigm in TaskFlow team but if we did that we would get enough freedom for using TaskFlow in lots of cases. Everything is a lot of work ;) That’s just how it goes. I think we work through it then it's all fine in the end. I think some of this is being resolved/discussed @ http://tinyurl.com/k3s2gmy Let me know what you think. I might have missed something. === HA === So this is an interesting question, and to me is strongly connected to how your engines are executing (and the persistence and state-transitions that they go through while running). Without persistence of state and transitions there is no good way (a bad way of course can be created, by just redoing all the work, but that's not always feasible or the best option) to accomplish resuming in a sane manner and there is also imho no way to accomplish any type of automated HA of workflows. Sure, no questions here. Let me describe: When you save the states of a workflow and any intermediate results of a workflow to some database
Re: [openstack-dev] [Mistral][TaskFlow] Long running actions
More inline. From: Dmitri Zimine d...@stackstorm.commailto:d...@stackstorm.com Reply-To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org Date: Tuesday, April 1, 2014 at 2:59 PM To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Mistral][TaskFlow] Long running actions On Apr 1, 2014, at 3:43 AM, Renat Akhmerov rakhme...@mirantis.commailto:rakhme...@mirantis.com wrote: On 25 Mar 2014, at 01:51, Joshua Harlow harlo...@yahoo-inc.commailto:harlo...@yahoo-inc.com wrote: The first execution model I would call the local execution model, this model involves forming tasks and flows and then executing them inside an application, that application is running for the duration of the workflow (although if it crashes it can re-establish the task and flows that it was doing and attempt to resume them). This could also be what openstack projects would call the 'conductor' approach where nova, ironic, trove have a conductor which manages these long-running actions (the conductor is alive/running throughout the duration of these workflows, although it may be restarted while running). The restarting + resuming part is something that openstack hasn't handled so gracefully currently, typically requiring either some type of cleanup at restart (or by operations), with taskflow using this model the resumption part makes it possible to resume from the last saved state (this connects into the persistence model that taskflow uses, the state transitions, how execution occurrs itself...). The second execution model is an extension of the first, whereby there is still a type of 'conductor' that is managing the life-time of the workflow, but instead of locally executing tasks in the conductor itself tasks are now executed on remote-workers (see http://tinyurl.com/lf3yqe4 ). The engine currently still is 'alive' for the life-time of the execution, although the work that it is doing is relatively minimal (since its not actually executing any task code, but proxying those requests to others works). The engine while running does the conducting of the remote-workers (saving persistence details, doing state-transtions, getting results, sending requests to workers…). These two execution models are special cases of what you call “lazy execution model” (or passive as we call it). To illustrate this idea we can take a look at the first sequence diagram at [0], we basically will see the following interaction: 1) engine --(task)-- queue --(task)-- worker 2) execute task 3) worker --(result)-- queue --(result)-- engine This is how TaskFlow worker based model works. If we loosen the requirement in 3) and assume that not only worker can send a task result back to engine we’ll get our passive model. Instead of worker it can be anything else (some external system) that knows how to make this call. A particular way is not too important, it can be a direct message or it can be hidden behind an API method. In Mistral it’s now a REST API method however we’re about to decouple engine from REST API so that engine is a standalone process and listens to a queue. So worker-based model is basically the same with the only strict requirement that only worker sends a result back. In order to implement local execution model on top of “lazy execution model” we just need to abstract a transport (queue) so that we can use an in-process transport. That’s it. It’s what Mistral already has implemented. Again, we see that “lazy execution model” is more universal. IMO this “lazy execution model” should be the main execution model that TaskFlow supports, others can be easily implemented on top of it. But the opposite assertion is wrong. IMO this is the most important obstacle in all our discussions, the reason why we don’t always understand each other well enough. I know it may be a lot of work to shift a paradigm in TaskFlow team but if we did that we would get enough freedom for using TaskFlow in lots of cases. Let me know what you think. I might have missed something. DZ: Interesting idea! So that other models of execution are based on lazy execution model? TaskFlow implements this, we can use it, and for other clients more convenient higher level execution models are provided? Interesting. Makes sense. @Joshua? @Kirill? Others? I think this is likely possible, which is simiar to whats in http://tinyurl.com/k3s2gmy, engine types can be built from each other (and if we wanted to alter the structure that exists in taskflow) then sure. But see that message for more of my concerns around exposing that engine API to library users (I think it could have its usage in mistral to expose this, but I'm not sure its useful for elsewhere, and once its public engine API, its public for a very long time). === HA === So
Re: [openstack-dev] [Mistral][TaskFlow] Long running actions
On 02 Apr 2014, at 06:00, Joshua Harlow harlo...@yahoo-inc.com wrote: More inline. From: Dmitri Zimine d...@stackstorm.com Reply-To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Date: Tuesday, April 1, 2014 at 2:59 PM To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Mistral][TaskFlow] Long running actions On Apr 1, 2014, at 3:43 AM, Renat Akhmerov rakhme...@mirantis.com wrote: On 25 Mar 2014, at 01:51, Joshua Harlow harlo...@yahoo-inc.com wrote: The first execution model I would call the local execution model, this model involves forming tasks and flows and then executing them inside an application, that application is running for the duration of the workflow (although if it crashes it can re-establish the task and flows that it was doing and attempt to resume them). This could also be what openstack projects would call the 'conductor' approach where nova, ironic, trove have a conductor which manages these long-running actions (the conductor is alive/running throughout the duration of these workflows, although it may be restarted while running). The restarting + resuming part is something that openstack hasn't handled so gracefully currently, typically requiring either some type of cleanup at restart (or by operations), with taskflow using this model the resumption part makes it possible to resume from the last saved state (this connects into the persistence model that taskflow uses, the state transitions, how execution occurrs itself...). The second execution model is an extension of the first, whereby there is still a type of 'conductor' that is managing the life-time of the workflow, but instead of locally executing tasks in the conductor itself tasks are now executed on remote-workers (see http://tinyurl.com/lf3yqe4 ). The engine currently still is 'alive' for the life-time of the execution, although the work that it is doing is relatively minimal (since its not actually executing any task code, but proxying those requests to others works). The engine while running does the conducting of the remote-workers (saving persistence details, doing state-transtions, getting results, sending requests to workers…). These two execution models are special cases of what you call “lazy execution model” (or passive as we call it). To illustrate this idea we can take a look at the first sequence diagram at [0], we basically will see the following interaction: 1) engine --(task)-- queue --(task)-- worker 2) execute task 3) worker --(result)-- queue --(result)-- engine This is how TaskFlow worker based model works. If we loosen the requirement in 3) and assume that not only worker can send a task result back to engine we’ll get our passive model. Instead of worker it can be anything else (some external system) that knows how to make this call. A particular way is not too important, it can be a direct message or it can be hidden behind an API method. In Mistral it’s now a REST API method however we’re about to decouple engine from REST API so that engine is a standalone process and listens to a queue. So worker-based model is basically the same with the only strict requirement that only worker sends a result back. In order to implement local execution model on top of “lazy execution model” we just need to abstract a transport (queue) so that we can use an in-process transport. That’s it. It’s what Mistral already has implemented. Again, we see that “lazy execution model” is more universal. IMO this “lazy execution model” should be the main execution model that TaskFlow supports, others can be easily implemented on top of it. But the opposite assertion is wrong. IMO this is the most important obstacle in all our discussions, the reason why we don’t always understand each other well enough. I know it may be a lot of work to shift a paradigm in TaskFlow team but if we did that we would get enough freedom for using TaskFlow in lots of cases. Let me know what you think. I might have missed something. DZ: Interesting idea! So that other models of execution are based on lazy execution model? TaskFlow implements this, we can use it, and for other clients more convenient higher level execution models are provided? Interesting. Makes sense. @Joshua? @Kirill? Others? I think this is likely possible, which is simiar to whats in http://tinyurl.com/k3s2gmy, engine types can be built from each other (and if we wanted to alter the structure that exists in taskflow) then sure. But see that message for more of my concerns around exposing that engine API to library users (I think it could have its usage in mistral to expose this, but I'm not sure its useful for elsewhere, and once its public engine API, its public for a very long time). What are we
Re: [openstack-dev] [Mistral][TaskFlow] Long running actions
Get er' done! Haha, u guys want to jump into #openstack-state-management tommorow or on our Thursday meeting we can discuss more how this might work and such. That'd be cool. Sent from my really tiny device... On Apr 1, 2014, at 9:37 PM, Renat Akhmerov rakhme...@mirantis.commailto:rakhme...@mirantis.com wrote: On 02 Apr 2014, at 06:00, Joshua Harlow harlo...@yahoo-inc.commailto:harlo...@yahoo-inc.com wrote: More inline. From: Dmitri Zimine d...@stackstorm.commailto:d...@stackstorm.com Reply-To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org Date: Tuesday, April 1, 2014 at 2:59 PM To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Mistral][TaskFlow] Long running actions On Apr 1, 2014, at 3:43 AM, Renat Akhmerov rakhme...@mirantis.commailto:rakhme...@mirantis.com wrote: On 25 Mar 2014, at 01:51, Joshua Harlow harlo...@yahoo-inc.commailto:harlo...@yahoo-inc.com wrote: The first execution model I would call the local execution model, this model involves forming tasks and flows and then executing them inside an application, that application is running for the duration of the workflow (although if it crashes it can re-establish the task and flows that it was doing and attempt to resume them). This could also be what openstack projects would call the 'conductor' approach where nova, ironic, trove have a conductor which manages these long-running actions (the conductor is alive/running throughout the duration of these workflows, although it may be restarted while running). The restarting + resuming part is something that openstack hasn't handled so gracefully currently, typically requiring either some type of cleanup at restart (or by operations), with taskflow using this model the resumption part makes it possible to resume from the last saved state (this connects into the persistence model that taskflow uses, the state transitions, how execution occurrs itself...). The second execution model is an extension of the first, whereby there is still a type of 'conductor' that is managing the life-time of the workflow, but instead of locally executing tasks in the conductor itself tasks are now executed on remote-workers (see http://tinyurl.com/lf3yqe4 ). The engine currently still is 'alive' for the life-time of the execution, although the work that it is doing is relatively minimal (since its not actually executing any task code, but proxying those requests to others works). The engine while running does the conducting of the remote-workers (saving persistence details, doing state-transtions, getting results, sending requests to workers…). These two execution models are special cases of what you call “lazy execution model” (or passive as we call it). To illustrate this idea we can take a look at the first sequence diagram at [0], we basically will see the following interaction: 1) engine --(task)-- queue --(task)-- worker 2) execute task 3) worker --(result)-- queue --(result)-- engine This is how TaskFlow worker based model works. If we loosen the requirement in 3) and assume that not only worker can send a task result back to engine we’ll get our passive model. Instead of worker it can be anything else (some external system) that knows how to make this call. A particular way is not too important, it can be a direct message or it can be hidden behind an API method. In Mistral it’s now a REST API method however we’re about to decouple engine from REST API so that engine is a standalone process and listens to a queue. So worker-based model is basically the same with the only strict requirement that only worker sends a result back. In order to implement local execution model on top of “lazy execution model” we just need to abstract a transport (queue) so that we can use an in-process transport. That’s it. It’s what Mistral already has implemented. Again, we see that “lazy execution model” is more universal. IMO this “lazy execution model” should be the main execution model that TaskFlow supports, others can be easily implemented on top of it. But the opposite assertion is wrong. IMO this is the most important obstacle in all our discussions, the reason why we don’t always understand each other well enough. I know it may be a lot of work to shift a paradigm in TaskFlow team but if we did that we would get enough freedom for using TaskFlow in lots of cases. Let me know what you think. I might have missed something. DZ: Interesting idea! So that other models of execution are based on lazy execution model? TaskFlow implements this, we can use it, and for other clients more convenient higher level execution models are provided? Interesting. Makes sense. @Joshua? @Kirill? Others? I think this is likely possible, which is simiar to whats in http
Re: [openstack-dev] [Mistral][TaskFlow] Long running actions
connected to how your engines are executing (and the persistence and state-transitions that they go through while running). Without persistence of state and transitions there is no good way (a bad way of course can be created, by just redoing all the work, but that's not always feasible or the best option) to accomplish resuming in a sane manner and there is also imho no way to accomplish any type of automated HA of workflows. Since taskflow was concieved to manage the states and transitions of tasks and flows it gains the ability to do this resuming but it also gains the ability to automatically provide execution HA to its users. Let me describe: When you save the states of a workflow and any intermediate results of a workflow to some database (for example) and the engine (see above models) which is being used (for example the conductor type from above) the application containing that engine may be prone to crashes (or just being powered off due to software upgrades...). Since taskflows key primitives were made to allow for resuming when a crash occurs, it is relatively simple to allow another application (also running a conductor) to resume whatever that prior application was doing when it crashed. Now most users of taskflow don't want to have to do this resumption manually (although they can if they want) so it would be expected that the other versions of that application would be running would automatically 'know' how to 'take-over' the work of the failed application. This is where the concept of the taskflows 'jobboard' (http://tinyurl.com/klg358j) comes into play, where a jobboard can be backed by something like zookeeper (which provides notifications of lock lose/release to others automatically). The jobboard is the place where the other applications would be looking to 'take-over' the other failed applications work (by using zookeeper 'atomic' primitives designed for this type of usage) and they would also be releasing the work back for others to 'take-over' when there own zookeeper connection is lost (zookeeper handles this this natively). -- Now as for how much of mistral would change from the above, I don't know, but thats why it's a POC. -Josh From: Joshua Harlow harlo...@yahoo-inc.com Date: Friday, March 21, 2014 at 1:14 PM To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Cc: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Mistral][TaskFlow] Long running actions Will advise soon, out sick with not so fun case of poison oak, will reply next week (hopefully) when I'm less incapacitated... Sent from my really tiny device... On Mar 21, 2014, at 3:24 AM, Renat Akhmerov rakhme...@mirantis.com wrote: Valid concerns. It would be great to get Joshua involved in this discussion. If it’s possible to do in TaskFlow he could advise on how exactly. Renat Akhmerov @ Mirantis Inc. On 21 Mar 2014, at 16:23, Stan Lagun sla...@mirantis.com wrote: Don't forget HA issues. Mistral can be restarted at any moment and need to be able to proceed from the place it was interrupted on another instance. In theory it can be addressed by TaskFlow but I'm not sure it can be done without complete redesign of it On Fri, Mar 21, 2014 at 8:33 AM, W Chan m4d.co...@gmail.com wrote: Can the long running task be handled by putting the target task in the workflow in a persisted state until either an event triggers it or timeout occurs? An event (human approval or trigger from an external system) sent to the transport will rejuvenate the task. The timeout is configurable by the end user up to a certain time limit set by the mistral admin. Based on the TaskFlow examples, it seems like the engine instance managing the workflow will be in memory until the flow is completed. Unless there's other options to schedule tasks in TaskFlow, if we have too many of these workflows with long running tasks, seems like it'll become a memory issue for mistral... On Thu, Mar 20, 2014 at 3:07 PM, Dmitri Zimine d...@stackstorm.com wrote: For the 'asynchronous manner' discussion see http://tinyurl.com/n3v9lt8; I'm still not sure why u would want to make is_sync/is_async a primitive concept in a workflow system, shouldn't this be only up to the entity running the workflow to decide? Why is a task allowed to be sync/async, that has major side-effects for state-persistence, resumption (and to me is a incorrect abstraction to provide) and general workflow execution control, I'd be very careful with this (which is why I am hesitant to add it without much much more discussion). Let's remove the confusion caused by async. All tasks [may] run async from the engine standpoint, agreed. Long running tasks - that's it. Examples: wait_5_days, run_hadoop_job
Re: [openstack-dev] [Mistral][TaskFlow] Long running actions
Cool, sounds great. I think all 3 models can co-exist (since each serves a good purpose), it'd be intersting to see how the POC 'engine' can become a taskflow 'engine' (aka the lazy_engine). As to scalability I agree lazy_engine would be nicer, but how much more scalable is a tough one to quantify (the openstack systems that have active conductors, aka model #2, seem to scale pretty well). Of course there are some interesting questions laziness brings up; it'd be interesting to see how the POC addressed them. Some questions I can think of (currently), maybe u can address them in the other thread (which is fine to). What does the watchdog do? Is it activated periodically to 'reap' jobs that have timed out (or have gone past some time limit)? How does the watchdog know that it is reaping jobs that are not actively being worked on (a timeout likely isn't sufficient for jobs that just take a very long time)? Is there a connection into zookepeer (or some similar system) to do this kind of 'liveness' verification instead? What does the watchdog do when reaping tasks? (revert them, retry them, other..?) I'm not quite sure how a taskflow would use mistral as a client for this watchdog, since the watchdog process is pretty key to the lazy_engines execution model and it seems like it would be a bad idea to split that logic from the actual execution model itself (seeing that the watchdog is involved in the execution process, and really isn't external to it). To me the concept of the lazy_engine is similar to the case where an engine 'crashes' while running, in a way the lazy_engine 'crashes on purpose' after asking a set of workers to do some action (and hands over the resumption of 'itself' to this watchdog process). The watchdog then watches over the workers, and on response from some worker the watchdog resumes the engine and then lets the engine 'crash on purpose' again (and repeat). So the watchdog - lazy_engine execution model seems to be pretty interconnected. -Josh From: Dmitri Zimine d...@stackstorm.commailto:d...@stackstorm.com Reply-To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org Date: Wednesday, March 26, 2014 at 2:12 PM To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Mistral][TaskFlow] Long running actions === Long-running delegate [1] actions == Yes, the third model of lazy / passive engine is needed. Obviously workflows contain a mix of different tasks, so this 3rd model should handle both normal tasks (run on a workers and return) and long running delegates. The active mechanism which is alive during the process, currently in done by TaskFlow engine, may be moved from the TaskFlow library to a client (Mistral) which implements the watchdog. This may require a lower-level API to TaskFlow. The benefit of the model 2 is 'ease of use' for some clients (create tasks, define flow, instantiate engine, engine.run(), that's it!). But I agree that the model 2 - worker-based TaskFlow engine - won't scale to WFaaS requirements even though the engine is not doing much. Mistral POC implements a passive, lazy workflow model: a service moving the states of multiple parallel executions. I'll detail the how Mistral handles long running tasks in a separate thread (may be here http://tinyurl.com/n3v9lt8) and we can look at how TaskFlow may change to fit. DZ PS. Thanks for clarifications on the target use cases for the execution models! [1] Calling them 'delegate actions' to distinguish between long running computations on workers, and actions that delegate to 3rd party systems (hadoop job, human input gateway, etc). On Mar 24, 2014, at 11:51 AM, Joshua Harlow harlo...@yahoo-inc.commailto:harlo...@yahoo-inc.com wrote: So getting back to this thread. I'd like to split it up into a few sections to address the HA and long-running-actions cases, which I believe are 2 seperate (but connected) questions. === Long-running actions === First, let me describe a little bit about what I believe are the execution models that taskflow currently targets (but is not limited to just targeting in general). The first execution model I would call the local execution model, this model involves forming tasks and flows and then executing them inside an application, that application is running for the duration of the workflow (although if it crashes it can re-establish the task and flows that it was doing and attempt to resume them). This could also be what openstack projects would call the 'conductor' approach where nova, ironic, trove have a conductor which manages these long-running actions (the conductor is alive/running throughout the duration of these workflows, although it may be restarted while running). The restarting + resuming part is something
Re: [openstack-dev] [Mistral][TaskFlow] Long running actions
On 03/21/2014 12:33 AM, W Chan wrote: Can the long running task be handled by putting the target task in the workflow in a persisted state until either an event triggers it or timeout occurs? An event (human approval or trigger from an external system) sent to the transport will rejuvenate the task. The timeout is configurable by the end user up to a certain time limit set by the mistral admin. Based on the TaskFlow examples, it seems like the engine instance managing the workflow will be in memory until the flow is completed. Unless there's other options to schedule tasks in TaskFlow, if we have too many of these workflows with long running tasks, seems like it'll become a memory issue for mistral... Look into the Trusts capability of Keystone for Authorization support on long running tasks. On Thu, Mar 20, 2014 at 3:07 PM, Dmitri Zimine d...@stackstorm.com mailto:d...@stackstorm.com wrote: For the 'asynchronous manner' discussion see http://tinyurl.com/n3v9lt8; I'm still not sure why u would want to make is_sync/is_async a primitive concept in a workflow system, shouldn't this be only up to the entity running the workflow to decide? Why is a task allowed to be sync/async, that has major side-effects for state-persistence, resumption (and to me is a incorrect abstraction to provide) and general workflow execution control, I'd be very careful with this (which is why I am hesitant to add it without much much more discussion). Let's remove the confusion caused by async. All tasks [may] run async from the engine standpoint, agreed. Long running tasks - that's it. Examples: wait_5_days, run_hadoop_job, take_human_input. The Task doesn't do the job: it delegates to an external system. The flow execution needs to wait (5 days passed, hadoob job finished with data x, user inputs y), and than continue with the received results. The requirement is to survive a restart of any WF component without loosing the state of the long running operation. Does TaskFlow already have a way to do it? Or ongoing ideas, considerations? If yes let's review. Else let's brainstorm together. I agree, that has major side-effects for state-persistence, resumption (and to me is a incorrect abstraction to provide) and general workflow execution control, I'd be very careful with this But these requirement comes from customers' use cases: wait_5_day - lifecycle management workflow, long running external system - Murano requirements, user input - workflow for operation automations with control gate checks, provisions which require 'approval' steps, etc. DZ ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org mailto:OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Mistral][TaskFlow] Long running actions
of the failed application. This is where the concept of the taskflows 'jobboard' (http://tinyurl.com/klg358j) comes into play, where a jobboard can be backed by something like zookeeper (which provides notifications of lock lose/release to others automatically). The jobboard is the place where the other applications would be looking to 'take-over' the other failed applications work (by using zookeeper 'atomic' primitives designed for this type of usage) and they would also be releasing the work back for others to 'take-over' when there own zookeeper connection is lost (zookeeper handles this this natively). -- Now as for how much of mistral would change from the above, I don't know, but thats why it's a POC. -Josh From: Joshua Harlow harlo...@yahoo-inc.commailto:harlo...@yahoo-inc.com Date: Friday, March 21, 2014 at 1:14 PM To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org Cc: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Mistral][TaskFlow] Long running actions Will advise soon, out sick with not so fun case of poison oak, will reply next week (hopefully) when I'm less incapacitated... Sent from my really tiny device... On Mar 21, 2014, at 3:24 AM, Renat Akhmerov rakhme...@mirantis.commailto:rakhme...@mirantis.com wrote: Valid concerns. It would be great to get Joshua involved in this discussion. If it’s possible to do in TaskFlow he could advise on how exactly. Renat Akhmerov @ Mirantis Inc. On 21 Mar 2014, at 16:23, Stan Lagun sla...@mirantis.commailto:sla...@mirantis.com wrote: Don't forget HA issues. Mistral can be restarted at any moment and need to be able to proceed from the place it was interrupted on another instance. In theory it can be addressed by TaskFlow but I'm not sure it can be done without complete redesign of it On Fri, Mar 21, 2014 at 8:33 AM, W Chan m4d.co...@gmail.commailto:m4d.co...@gmail.com wrote: Can the long running task be handled by putting the target task in the workflow in a persisted state until either an event triggers it or timeout occurs? An event (human approval or trigger from an external system) sent to the transport will rejuvenate the task. The timeout is configurable by the end user up to a certain time limit set by the mistral admin. Based on the TaskFlow examples, it seems like the engine instance managing the workflow will be in memory until the flow is completed. Unless there's other options to schedule tasks in TaskFlow, if we have too many of these workflows with long running tasks, seems like it'll become a memory issue for mistral... On Thu, Mar 20, 2014 at 3:07 PM, Dmitri Zimine d...@stackstorm.commailto:d...@stackstorm.com wrote: For the 'asynchronous manner' discussion see http://tinyurl.com/n3v9lt8; I'm still not sure why u would want to make is_sync/is_async a primitive concept in a workflow system, shouldn't this be only up to the entity running the workflow to decide? Why is a task allowed to be sync/async, that has major side-effects for state-persistence, resumption (and to me is a incorrect abstraction to provide) and general workflow execution control, I'd be very careful with this (which is why I am hesitant to add it without much much more discussion). Let's remove the confusion caused by async. All tasks [may] run async from the engine standpoint, agreed. Long running tasks - that's it. Examples: wait_5_days, run_hadoop_job, take_human_input. The Task doesn't do the job: it delegates to an external system. The flow execution needs to wait (5 days passed, hadoob job finished with data x, user inputs y), and than continue with the received results. The requirement is to survive a restart of any WF component without loosing the state of the long running operation. Does TaskFlow already have a way to do it? Or ongoing ideas, considerations? If yes let's review. Else let's brainstorm together. I agree, that has major side-effects for state-persistence, resumption (and to me is a incorrect abstraction to provide) and general workflow execution control, I'd be very careful with this But these requirement comes from customers' use cases: wait_5_day - lifecycle management workflow, long running external system - Murano requirements, user input - workflow for operation automations with control gate checks, provisions which require 'approval' steps, etc. DZ ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.orgmailto:OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.orgmailto:OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Mistral][TaskFlow] Long running actions
Don't forget HA issues. Mistral can be restarted at any moment and need to be able to proceed from the place it was interrupted on another instance. In theory it can be addressed by TaskFlow but I'm not sure it can be done without complete redesign of it On Fri, Mar 21, 2014 at 8:33 AM, W Chan m4d.co...@gmail.com wrote: Can the long running task be handled by putting the target task in the workflow in a persisted state until either an event triggers it or timeout occurs? An event (human approval or trigger from an external system) sent to the transport will rejuvenate the task. The timeout is configurable by the end user up to a certain time limit set by the mistral admin. Based on the TaskFlow examples, it seems like the engine instance managing the workflow will be in memory until the flow is completed. Unless there's other options to schedule tasks in TaskFlow, if we have too many of these workflows with long running tasks, seems like it'll become a memory issue for mistral... On Thu, Mar 20, 2014 at 3:07 PM, Dmitri Zimine d...@stackstorm.com wrote: For the 'asynchronous manner' discussion see http://tinyurl.com/n3v9lt8; I'm still not sure why u would want to make is_sync/is_async a primitive concept in a workflow system, shouldn't this be only up to the entity running the workflow to decide? Why is a task allowed to be sync/async, that has major side-effects for state-persistence, resumption (and to me is a incorrect abstraction to provide) and general workflow execution control, I'd be very careful with this (which is why I am hesitant to add it without much much more discussion). Let's remove the confusion caused by async. All tasks [may] run async from the engine standpoint, agreed. Long running tasks - that's it. Examples: wait_5_days, run_hadoop_job, take_human_input. The Task doesn't do the job: it delegates to an external system. The flow execution needs to wait (5 days passed, hadoob job finished with data x, user inputs y), and than continue with the received results. The requirement is to survive a restart of any WF component without loosing the state of the long running operation. Does TaskFlow already have a way to do it? Or ongoing ideas, considerations? If yes let's review. Else let's brainstorm together. I agree, that has major side-effects for state-persistence, resumption (and to me is a incorrect abstraction to provide) and general workflow execution control, I'd be very careful with this But these requirement comes from customers' use cases: wait_5_day - lifecycle management workflow, long running external system - Murano requirements, user input - workflow for operation automations with control gate checks, provisions which require 'approval' steps, etc. DZ ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Sincerely yours Stanislav (Stan) Lagun Senior Developer Mirantis 35b/3, Vorontsovskaya St. Moscow, Russia Skype: stanlagun www.mirantis.com sla...@mirantis.com ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Mistral][TaskFlow] Long running actions
Valid concerns. It would be great to get Joshua involved in this discussion. If it’s possible to do in TaskFlow he could advise on how exactly. Renat Akhmerov @ Mirantis Inc. On 21 Mar 2014, at 16:23, Stan Lagun sla...@mirantis.com wrote: Don't forget HA issues. Mistral can be restarted at any moment and need to be able to proceed from the place it was interrupted on another instance. In theory it can be addressed by TaskFlow but I'm not sure it can be done without complete redesign of it On Fri, Mar 21, 2014 at 8:33 AM, W Chan m4d.co...@gmail.com wrote: Can the long running task be handled by putting the target task in the workflow in a persisted state until either an event triggers it or timeout occurs? An event (human approval or trigger from an external system) sent to the transport will rejuvenate the task. The timeout is configurable by the end user up to a certain time limit set by the mistral admin. Based on the TaskFlow examples, it seems like the engine instance managing the workflow will be in memory until the flow is completed. Unless there's other options to schedule tasks in TaskFlow, if we have too many of these workflows with long running tasks, seems like it'll become a memory issue for mistral... On Thu, Mar 20, 2014 at 3:07 PM, Dmitri Zimine d...@stackstorm.com wrote: For the 'asynchronous manner' discussion see http://tinyurl.com/n3v9lt8; I'm still not sure why u would want to make is_sync/is_async a primitive concept in a workflow system, shouldn't this be only up to the entity running the workflow to decide? Why is a task allowed to be sync/async, that has major side-effects for state-persistence, resumption (and to me is a incorrect abstraction to provide) and general workflow execution control, I'd be very careful with this (which is why I am hesitant to add it without much much more discussion). Let's remove the confusion caused by async. All tasks [may] run async from the engine standpoint, agreed. Long running tasks - that's it. Examples: wait_5_days, run_hadoop_job, take_human_input. The Task doesn't do the job: it delegates to an external system. The flow execution needs to wait (5 days passed, hadoob job finished with data x, user inputs y), and than continue with the received results. The requirement is to survive a restart of any WF component without loosing the state of the long running operation. Does TaskFlow already have a way to do it? Or ongoing ideas, considerations? If yes let's review. Else let's brainstorm together. I agree, that has major side-effects for state-persistence, resumption (and to me is a incorrect abstraction to provide) and general workflow execution control, I'd be very careful with this But these requirement comes from customers' use cases: wait_5_day - lifecycle management workflow, long running external system - Murano requirements, user input - workflow for operation automations with control gate checks, provisions which require 'approval' steps, etc. DZ ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Sincerely yours Stanislav (Stan) Lagun Senior Developer Mirantis 35b/3, Vorontsovskaya St. Moscow, Russia Skype: stanlagun www.mirantis.com sla...@mirantis.com ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Mistral][TaskFlow] Long running actions
Will advise soon, out sick with not so fun case of poison oak, will reply next week (hopefully) when I'm less incapacitated... Sent from my really tiny device... On Mar 21, 2014, at 3:24 AM, Renat Akhmerov rakhme...@mirantis.commailto:rakhme...@mirantis.com wrote: Valid concerns. It would be great to get Joshua involved in this discussion. If it’s possible to do in TaskFlow he could advise on how exactly. Renat Akhmerov @ Mirantis Inc. On 21 Mar 2014, at 16:23, Stan Lagun sla...@mirantis.commailto:sla...@mirantis.com wrote: Don't forget HA issues. Mistral can be restarted at any moment and need to be able to proceed from the place it was interrupted on another instance. In theory it can be addressed by TaskFlow but I'm not sure it can be done without complete redesign of it On Fri, Mar 21, 2014 at 8:33 AM, W Chan m4d.co...@gmail.commailto:m4d.co...@gmail.com wrote: Can the long running task be handled by putting the target task in the workflow in a persisted state until either an event triggers it or timeout occurs? An event (human approval or trigger from an external system) sent to the transport will rejuvenate the task. The timeout is configurable by the end user up to a certain time limit set by the mistral admin. Based on the TaskFlow examples, it seems like the engine instance managing the workflow will be in memory until the flow is completed. Unless there's other options to schedule tasks in TaskFlow, if we have too many of these workflows with long running tasks, seems like it'll become a memory issue for mistral... On Thu, Mar 20, 2014 at 3:07 PM, Dmitri Zimine d...@stackstorm.commailto:d...@stackstorm.com wrote: For the 'asynchronous manner' discussion see http://tinyurl.com/n3v9lt8; I'm still not sure why u would want to make is_sync/is_async a primitive concept in a workflow system, shouldn't this be only up to the entity running the workflow to decide? Why is a task allowed to be sync/async, that has major side-effects for state-persistence, resumption (and to me is a incorrect abstraction to provide) and general workflow execution control, I'd be very careful with this (which is why I am hesitant to add it without much much more discussion). Let's remove the confusion caused by async. All tasks [may] run async from the engine standpoint, agreed. Long running tasks - that's it. Examples: wait_5_days, run_hadoop_job, take_human_input. The Task doesn't do the job: it delegates to an external system. The flow execution needs to wait (5 days passed, hadoob job finished with data x, user inputs y), and than continue with the received results. The requirement is to survive a restart of any WF component without loosing the state of the long running operation. Does TaskFlow already have a way to do it? Or ongoing ideas, considerations? If yes let's review. Else let's brainstorm together. I agree, that has major side-effects for state-persistence, resumption (and to me is a incorrect abstraction to provide) and general workflow execution control, I'd be very careful with this But these requirement comes from customers' use cases: wait_5_day - lifecycle management workflow, long running external system - Murano requirements, user input - workflow for operation automations with control gate checks, provisions which require 'approval' steps, etc. DZ ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.orgmailto:OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.orgmailto:OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Sincerely yours Stanislav (Stan) Lagun Senior Developer Mirantis 35b/3, Vorontsovskaya St. Moscow, Russia Skype: stanlagun www.mirantis.comhttp://www.mirantis.com/ sla...@mirantis.commailto:sla...@mirantis.com ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.orgmailto:OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.orgmailto:OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [Mistral][TaskFlow] Long running actions
For the 'asynchronous manner' discussion see http://tinyurl.com/n3v9lt8; I'm still not sure why u would want to make is_sync/is_async a primitive concept in a workflow system, shouldn't this be only up to the entity running the workflow to decide? Why is a task allowed to be sync/async, that has major side-effects for state-persistence, resumption (and to me is a incorrect abstraction to provide) and general workflow execution control, I'd be very careful with this (which is why I am hesitant to add it without much much more discussion). Let's remove the confusion caused by async. All tasks [may] run async from the engine standpoint, agreed. Long running tasks - that's it. Examples: wait_5_days, run_hadoop_job, take_human_input. The Task doesn't do the job: it delegates to an external system. The flow execution needs to wait (5 days passed, hadoob job finished with data x, user inputs y), and than continue with the received results. The requirement is to survive a restart of any WF component without loosing the state of the long running operation. Does TaskFlow already have a way to do it? Or ongoing ideas, considerations? If yes let's review. Else let's brainstorm together. I agree, that has major side-effects for state-persistence, resumption (and to me is a incorrect abstraction to provide) and general workflow execution control, I'd be very careful with this But these requirement comes from customers' use cases: wait_5_day - lifecycle management workflow, long running external system - Murano requirements, user input - workflow for operation automations with control gate checks, provisions which require 'approval' steps, etc. DZ ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Mistral][TaskFlow] Long running actions
Can the long running task be handled by putting the target task in the workflow in a persisted state until either an event triggers it or timeout occurs? An event (human approval or trigger from an external system) sent to the transport will rejuvenate the task. The timeout is configurable by the end user up to a certain time limit set by the mistral admin. Based on the TaskFlow examples, it seems like the engine instance managing the workflow will be in memory until the flow is completed. Unless there's other options to schedule tasks in TaskFlow, if we have too many of these workflows with long running tasks, seems like it'll become a memory issue for mistral... On Thu, Mar 20, 2014 at 3:07 PM, Dmitri Zimine d...@stackstorm.com wrote: For the 'asynchronous manner' discussion see http://tinyurl.com/n3v9lt8; I'm still not sure why u would want to make is_sync/is_async a primitive concept in a workflow system, shouldn't this be only up to the entity running the workflow to decide? Why is a task allowed to be sync/async, that has major side-effects for state-persistence, resumption (and to me is a incorrect abstraction to provide) and general workflow execution control, I'd be very careful with this (which is why I am hesitant to add it without much much more discussion). Let's remove the confusion caused by async. All tasks [may] run async from the engine standpoint, agreed. Long running tasks - that's it. Examples: wait_5_days, run_hadoop_job, take_human_input. The Task doesn't do the job: it delegates to an external system. The flow execution needs to wait (5 days passed, hadoob job finished with data x, user inputs y), and than continue with the received results. The requirement is to survive a restart of any WF component without loosing the state of the long running operation. Does TaskFlow already have a way to do it? Or ongoing ideas, considerations? If yes let's review. Else let's brainstorm together. I agree, that has major side-effects for state-persistence, resumption (and to me is a incorrect abstraction to provide) and general workflow execution control, I'd be very careful with this But these requirement comes from customers' use cases: wait_5_day - lifecycle management workflow, long running external system - Murano requirements, user input - workflow for operation automations with control gate checks, provisions which require 'approval' steps, etc. DZ ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev