Re: [openstack-dev] [Mistral][TaskFlow] Long running actions

2014-04-01 Thread Dmitri Zimine

On Apr 1, 2014, at 3:43 AM, Renat Akhmerov rakhme...@mirantis.com wrote:
 On 25 Mar 2014, at 01:51, Joshua Harlow harlo...@yahoo-inc.com wrote:
 
 The first execution model I would call the local execution model, this model 
 involves forming tasks and flows and then executing them inside an 
 application, that application is running for the duration of the workflow 
 (although if it crashes it can re-establish the task and flows that it was 
 doing and attempt to resume them). This could also be what openstack 
 projects would call the 'conductor' approach where nova, ironic, trove have 
 a conductor which manages these long-running actions (the conductor is 
 alive/running throughout the duration of these workflows, although it may be 
 restarted while running). The restarting + resuming part is something that 
 openstack hasn't handled so gracefully currently, typically requiring either 
 some type of cleanup at restart (or by operations), with taskflow using this 
 model the resumption part makes it possible to resume from the last saved 
 state (this connects into the persistence model that taskflow uses, the 
 state transitions, how execution occurrs itself...). 
 
 The second execution model is an extension of the first, whereby there is 
 still a type of 'conductor' that is managing the life-time of the workflow, 
 but instead of locally executing tasks in the conductor itself tasks are now 
 executed on remote-workers (see http://tinyurl.com/lf3yqe4
 ). The engine currently still is 'alive' for the life-time of the execution, 
 although the work that it is doing is relatively minimal (since its not 
 actually executing any task code, but proxying those requests to others 
 works). The engine while running does the conducting of the remote-workers 
 (saving persistence details, doing state-transtions, getting results, 
 sending requests to workers…).
 
 These two execution models are special cases of what you call “lazy execution 
 model” (or passive as we call it). To illustrate this idea we can take a look 
 at the first sequence diagram at [0], we basically will see the following 
 interaction:
 
 1) engine --(task)-- queue --(task)-- worker
 2) execute task
 3) worker --(result)-- queue --(result)-- engine
 
 This is how TaskFlow worker based model works.
 
 If we loosen the requirement in 3) and assume that not only worker can send a 
 task result back to engine we’ll get our passive model. Instead of worker it 
 can be anything else (some external system) that knows how to make this call. 
 A particular way is not too important, it can be a direct message or it can 
 be hidden behind an API method. In Mistral it’s now a REST API method however 
 we’re about to decouple engine from REST API so that engine is a standalone 
 process and listens to a queue. So worker-based model is basically the same 
 with the only strict requirement that only worker sends a result back.
 
 In order to implement local execution model on top of “lazy execution model” 
 we just need to abstract a transport (queue) so that we can use an in-process 
 transport. That’s it. It’s what Mistral already has implemented. Again, we 
 see that “lazy execution model” is more universal.
 
 IMO this “lazy execution model” should be the main execution model that 
 TaskFlow supports, others can be easily implemented on top of it. But the 
 opposite assertion is wrong. IMO this is the most important obstacle in all 
 our discussions, the reason why we don’t always understand each other well 
 enough. I know it may be a lot of work to shift a paradigm in TaskFlow team 
 but if we did that we would get enough freedom for using TaskFlow in lots of 
 cases.
 
 Let me know what you think. I might have missed something.

DZ: Interesting idea! So that other models of execution are based on lazy 
execution model? TaskFlow implements this, we can use it, and for other clients 
more convenient higher level execution models are provided? Interesting. Makes 
sense.
@Joshua? @Kirill? Others? 

 
 === HA ===
 
 So this is an interesting question, and to me is strongly connected to how 
 your engines are executing (and the persistence and state-transitions that 
 they go through while running). Without persistence of state and transitions 
 there is no good way (a bad way of course can be created, by just redoing 
 all the work, but that's not always feasible or the best option) to 
 accomplish resuming in a sane manner and there is also imho no way to 
 accomplish any type of automated HA of workflows. 
 
 Sure, no questions here.
 
 Let me describe:
 
 When you save the states of a workflow and any intermediate results of a 
 workflow to some database (for example) and the engine (see above models) 
 which is being used (for example the conductor type from above) the 
 application containing that engine may be prone to crashes (or just being 
 powered off due to software upgrades...). Since taskflows key primitives 
 were made to allow for 

Re: [openstack-dev] [Mistral][TaskFlow] Long running actions

2014-04-01 Thread Joshua Harlow
Inline responses.

From: Renat Akhmerov rakhme...@mirantis.commailto:rakhme...@mirantis.com
Reply-To: OpenStack Development Mailing List (not for usage questions) 
openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
Date: Tuesday, April 1, 2014 at 3:43 AM
To: OpenStack Development Mailing List (not for usage questions) 
openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [Mistral][TaskFlow] Long running actions


On 25 Mar 2014, at 01:51, Joshua Harlow 
harlo...@yahoo-inc.commailto:harlo...@yahoo-inc.com wrote:

The first execution model I would call the local execution model, this model 
involves forming tasks and flows and then executing them inside an application, 
that application is running for the duration of the workflow (although if it 
crashes it can re-establish the task and flows that it was doing and attempt to 
resume them). This could also be what openstack projects would call the 
'conductor' approach where nova, ironic, trove have a conductor which manages 
these long-running actions (the conductor is alive/running throughout the 
duration of these workflows, although it may be restarted while running). The 
restarting + resuming part is something that openstack hasn't handled so 
gracefully currently, typically requiring either some type of cleanup at 
restart (or by operations), with taskflow using this model the resumption part 
makes it possible to resume from the last saved state (this connects into the 
persistence model that taskflow uses, the state transitions, how execution 
occurrs itself...).

The second execution model is an extension of the first, whereby there is still 
a type of 'conductor' that is managing the life-time of the workflow, but 
instead of locally executing tasks in the conductor itself tasks are now 
executed on remote-workers (see http://tinyurl.com/lf3yqe4
). The engine currently still is 'alive' for the life-time of the execution, 
although the work that it is doing is relatively minimal (since its not 
actually executing any task code, but proxying those requests to others works). 
The engine while running does the conducting of the remote-workers (saving 
persistence details, doing state-transtions, getting results, sending requests 
to workers…).

These two execution models are special cases of what you call “lazy execution 
model” (or passive as we call it). To illustrate this idea we can take a look 
at the first sequence diagram at [0], we basically will see the following 
interaction:

1) engine --(task)-- queue --(task)-- worker
2) execute task
3) worker --(result)-- queue --(result)-- engine

This is how TaskFlow worker based model works.

If we loosen the requirement in 3) and assume that not only worker can send a 
task result back to engine we’ll get our passive model. Instead of worker it 
can be anything else (some external system) that knows how to make this call. A 
particular way is not too important, it can be a direct message or it can be 
hidden behind an API method. In Mistral it’s now a REST API method however 
we’re about to decouple engine from REST API so that engine is a standalone 
process and listens to a queue. So worker-based model is basically the same 
with the only strict requirement that only worker sends a result back.

In order to implement local execution model on top of “lazy execution model” we 
just need to abstract a transport (queue) so that we can use an in-process 
transport. That’s it. It’s what Mistral already has implemented. Again, we see 
that “lazy execution model” is more universal.

IMO this “lazy execution model” should be the main execution model that 
TaskFlow supports, others can be easily implemented on top of it. But the 
opposite assertion is wrong. IMO this is the most important obstacle in all our 
discussions, the reason why we don’t always understand each other well enough. 
I know it may be a lot of work to shift a paradigm in TaskFlow team but if we 
did that we would get enough freedom for using TaskFlow in lots of cases.

Everything is a lot of work ;) That’s just how it goes. I think we work through 
it then it's all fine in the end.

I think some of this is being resolved/discussed @ http://tinyurl.com/k3s2gmy


Let me know what you think. I might have missed something.

=== HA ===

So this is an interesting question, and to me is strongly connected to how your 
engines are executing (and the persistence and state-transitions that they go 
through while running). Without persistence of state and transitions there is 
no good way (a bad way of course can be created, by just redoing all the work, 
but that's not always feasible or the best option) to accomplish resuming in a 
sane manner and there is also imho no way to accomplish any type of automated 
HA of workflows.

Sure, no questions here.

Let me describe:

When you save the states of a workflow and any intermediate results of a 
workflow to some database

Re: [openstack-dev] [Mistral][TaskFlow] Long running actions

2014-04-01 Thread Joshua Harlow
More inline.

From: Dmitri Zimine d...@stackstorm.commailto:d...@stackstorm.com
Reply-To: OpenStack Development Mailing List (not for usage questions) 
openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
Date: Tuesday, April 1, 2014 at 2:59 PM
To: OpenStack Development Mailing List (not for usage questions) 
openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [Mistral][TaskFlow] Long running actions


On Apr 1, 2014, at 3:43 AM, Renat Akhmerov 
rakhme...@mirantis.commailto:rakhme...@mirantis.com wrote:
On 25 Mar 2014, at 01:51, Joshua Harlow 
harlo...@yahoo-inc.commailto:harlo...@yahoo-inc.com wrote:

The first execution model I would call the local execution model, this model 
involves forming tasks and flows and then executing them inside an application, 
that application is running for the duration of the workflow (although if it 
crashes it can re-establish the task and flows that it was doing and attempt to 
resume them). This could also be what openstack projects would call the 
'conductor' approach where nova, ironic, trove have a conductor which manages 
these long-running actions (the conductor is alive/running throughout the 
duration of these workflows, although it may be restarted while running). The 
restarting + resuming part is something that openstack hasn't handled so 
gracefully currently, typically requiring either some type of cleanup at 
restart (or by operations), with taskflow using this model the resumption part 
makes it possible to resume from the last saved state (this connects into the 
persistence model that taskflow uses, the state transitions, how execution 
occurrs itself...).

The second execution model is an extension of the first, whereby there is still 
a type of 'conductor' that is managing the life-time of the workflow, but 
instead of locally executing tasks in the conductor itself tasks are now 
executed on remote-workers (see http://tinyurl.com/lf3yqe4
). The engine currently still is 'alive' for the life-time of the execution, 
although the work that it is doing is relatively minimal (since its not 
actually executing any task code, but proxying those requests to others works). 
The engine while running does the conducting of the remote-workers (saving 
persistence details, doing state-transtions, getting results, sending requests 
to workers…).

These two execution models are special cases of what you call “lazy execution 
model” (or passive as we call it). To illustrate this idea we can take a look 
at the first sequence diagram at [0], we basically will see the following 
interaction:

1) engine --(task)-- queue --(task)-- worker
2) execute task
3) worker --(result)-- queue --(result)-- engine

This is how TaskFlow worker based model works.

If we loosen the requirement in 3) and assume that not only worker can send a 
task result back to engine we’ll get our passive model. Instead of worker it 
can be anything else (some external system) that knows how to make this call. A 
particular way is not too important, it can be a direct message or it can be 
hidden behind an API method. In Mistral it’s now a REST API method however 
we’re about to decouple engine from REST API so that engine is a standalone 
process and listens to a queue. So worker-based model is basically the same 
with the only strict requirement that only worker sends a result back.

In order to implement local execution model on top of “lazy execution model” we 
just need to abstract a transport (queue) so that we can use an in-process 
transport. That’s it. It’s what Mistral already has implemented. Again, we see 
that “lazy execution model” is more universal.

IMO this “lazy execution model” should be the main execution model that 
TaskFlow supports, others can be easily implemented on top of it. But the 
opposite assertion is wrong. IMO this is the most important obstacle in all our 
discussions, the reason why we don’t always understand each other well enough. 
I know it may be a lot of work to shift a paradigm in TaskFlow team but if we 
did that we would get enough freedom for using TaskFlow in lots of cases.

Let me know what you think. I might have missed something.

DZ: Interesting idea! So that other models of execution are based on lazy 
execution model? TaskFlow implements this, we can use it, and for other clients 
more convenient higher level execution models are provided? Interesting. Makes 
sense.
@Joshua? @Kirill? Others?

I think this is likely possible, which is simiar to whats in 
http://tinyurl.com/k3s2gmy, engine types can be built from each other (and if 
we wanted to alter the structure that exists in taskflow) then sure. But see 
that message for more of my concerns around exposing that engine API to library 
users (I think it could have its usage in mistral to expose this, but I'm not 
sure its useful for elsewhere, and once its public engine API, its public for a 
very long time).



=== HA ===

So

Re: [openstack-dev] [Mistral][TaskFlow] Long running actions

2014-04-01 Thread Renat Akhmerov
On 02 Apr 2014, at 06:00, Joshua Harlow harlo...@yahoo-inc.com wrote:

 More inline.
 
 From: Dmitri Zimine d...@stackstorm.com
 Reply-To: OpenStack Development Mailing List (not for usage questions) 
 openstack-dev@lists.openstack.org
 Date: Tuesday, April 1, 2014 at 2:59 PM
 To: OpenStack Development Mailing List (not for usage questions) 
 openstack-dev@lists.openstack.org
 Subject: Re: [openstack-dev] [Mistral][TaskFlow] Long running actions
 
 
 On Apr 1, 2014, at 3:43 AM, Renat Akhmerov rakhme...@mirantis.com wrote:
 On 25 Mar 2014, at 01:51, Joshua Harlow harlo...@yahoo-inc.com wrote:
 
 The first execution model I would call the local execution model, this 
 model involves forming tasks and flows and then executing them inside an 
 application, that application is running for the duration of the workflow 
 (although if it crashes it can re-establish the task and flows that it was 
 doing and attempt to resume them). This could also be what openstack 
 projects would call the 'conductor' approach where nova, ironic, trove 
 have a conductor which manages these long-running actions (the conductor 
 is alive/running throughout the duration of these workflows, although it 
 may be restarted while running). The restarting + resuming part is 
 something that openstack hasn't handled so gracefully currently, typically 
 requiring either some type of cleanup at restart (or by operations), with 
 taskflow using this model the resumption part makes it possible to resume 
 from the last saved state (this connects into the persistence model that 
 taskflow uses, the state transitions, how execution occurrs itself...). 
 
 The second execution model is an extension of the first, whereby there is 
 still a type of 'conductor' that is managing the life-time of the 
 workflow, but instead of locally executing tasks in the conductor itself 
 tasks are now executed on remote-workers (see http://tinyurl.com/lf3yqe4
 ). The engine currently still is 'alive' for the life-time of the 
 execution, although the work that it is doing is relatively minimal (since 
 its not actually executing any task code, but proxying those requests to 
 others works). The engine while running does the conducting of the 
 remote-workers (saving persistence details, doing state-transtions, 
 getting results, sending requests to workers…).
 
 These two execution models are special cases of what you call “lazy 
 execution model” (or passive as we call it). To illustrate this idea we can 
 take a look at the first sequence diagram at [0], we basically will see the 
 following interaction:
 
 1) engine --(task)-- queue --(task)-- worker
 2) execute task
 3) worker --(result)-- queue --(result)-- engine
 
 This is how TaskFlow worker based model works.
 
 If we loosen the requirement in 3) and assume that not only worker can send 
 a task result back to engine we’ll get our passive model. Instead of worker 
 it can be anything else (some external system) that knows how to make this 
 call. A particular way is not too important, it can be a direct message or 
 it can be hidden behind an API method. In Mistral it’s now a REST API 
 method however we’re about to decouple engine from REST API so that engine 
 is a standalone process and listens to a queue. So worker-based model is 
 basically the same with the only strict requirement that only worker sends 
 a result back.
 
 In order to implement local execution model on top of “lazy execution 
 model” we just need to abstract a transport (queue) so that we can use an 
 in-process transport. That’s it. It’s what Mistral already has implemented. 
 Again, we see that “lazy execution model” is more universal.
 
 IMO this “lazy execution model” should be the main execution model that 
 TaskFlow supports, others can be easily implemented on top of it. But the 
 opposite assertion is wrong. IMO this is the most important obstacle in all 
 our discussions, the reason why we don’t always understand each other well 
 enough. I know it may be a lot of work to shift a paradigm in TaskFlow team 
 but if we did that we would get enough freedom for using TaskFlow in lots 
 of cases.
 
 Let me know what you think. I might have missed something.
 
 DZ: Interesting idea! So that other models of execution are based on lazy 
 execution model? TaskFlow implements this, we can use it, and for other 
 clients more convenient higher level execution models are provided? 
 Interesting. Makes sense.
 @Joshua? @Kirill? Others? 
 
 
 I think this is likely possible, which is simiar to whats in 
 http://tinyurl.com/k3s2gmy, engine types can be built from each other (and if 
 we wanted to alter the structure that exists in taskflow) then sure. But see 
 that message for more of my concerns around exposing that engine API to 
 library users (I think it could have its usage in mistral to expose this, but 
 I'm not sure its useful for elsewhere, and once its public engine API, its 
 public for a very long time).

What are we

Re: [openstack-dev] [Mistral][TaskFlow] Long running actions

2014-04-01 Thread Joshua Harlow
Get er' done!

Haha, u guys want to jump into #openstack-state-management tommorow or on our 
Thursday meeting we can discuss more how this might work and such.

That'd be cool.

Sent from my really tiny device...

On Apr 1, 2014, at 9:37 PM, Renat Akhmerov 
rakhme...@mirantis.commailto:rakhme...@mirantis.com wrote:

On 02 Apr 2014, at 06:00, Joshua Harlow 
harlo...@yahoo-inc.commailto:harlo...@yahoo-inc.com wrote:

More inline.

From: Dmitri Zimine d...@stackstorm.commailto:d...@stackstorm.com
Reply-To: OpenStack Development Mailing List (not for usage questions) 
openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
Date: Tuesday, April 1, 2014 at 2:59 PM
To: OpenStack Development Mailing List (not for usage questions) 
openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [Mistral][TaskFlow] Long running actions


On Apr 1, 2014, at 3:43 AM, Renat Akhmerov 
rakhme...@mirantis.commailto:rakhme...@mirantis.com wrote:
On 25 Mar 2014, at 01:51, Joshua Harlow 
harlo...@yahoo-inc.commailto:harlo...@yahoo-inc.com wrote:

The first execution model I would call the local execution model, this model 
involves forming tasks and flows and then executing them inside an application, 
that application is running for the duration of the workflow (although if it 
crashes it can re-establish the task and flows that it was doing and attempt to 
resume them). This could also be what openstack projects would call the 
'conductor' approach where nova, ironic, trove have a conductor which manages 
these long-running actions (the conductor is alive/running throughout the 
duration of these workflows, although it may be restarted while running). The 
restarting + resuming part is something that openstack hasn't handled so 
gracefully currently, typically requiring either some type of cleanup at 
restart (or by operations), with taskflow using this model the resumption part 
makes it possible to resume from the last saved state (this connects into the 
persistence model that taskflow uses, the state transitions, how execution 
occurrs itself...).

The second execution model is an extension of the first, whereby there is still 
a type of 'conductor' that is managing the life-time of the workflow, but 
instead of locally executing tasks in the conductor itself tasks are now 
executed on remote-workers (see http://tinyurl.com/lf3yqe4
). The engine currently still is 'alive' for the life-time of the execution, 
although the work that it is doing is relatively minimal (since its not 
actually executing any task code, but proxying those requests to others works). 
The engine while running does the conducting of the remote-workers (saving 
persistence details, doing state-transtions, getting results, sending requests 
to workers…).

These two execution models are special cases of what you call “lazy execution 
model” (or passive as we call it). To illustrate this idea we can take a look 
at the first sequence diagram at [0], we basically will see the following 
interaction:

1) engine --(task)-- queue --(task)-- worker
2) execute task
3) worker --(result)-- queue --(result)-- engine

This is how TaskFlow worker based model works.

If we loosen the requirement in 3) and assume that not only worker can send a 
task result back to engine we’ll get our passive model. Instead of worker it 
can be anything else (some external system) that knows how to make this call. A 
particular way is not too important, it can be a direct message or it can be 
hidden behind an API method. In Mistral it’s now a REST API method however 
we’re about to decouple engine from REST API so that engine is a standalone 
process and listens to a queue. So worker-based model is basically the same 
with the only strict requirement that only worker sends a result back.

In order to implement local execution model on top of “lazy execution model” we 
just need to abstract a transport (queue) so that we can use an in-process 
transport. That’s it. It’s what Mistral already has implemented. Again, we see 
that “lazy execution model” is more universal.

IMO this “lazy execution model” should be the main execution model that 
TaskFlow supports, others can be easily implemented on top of it. But the 
opposite assertion is wrong. IMO this is the most important obstacle in all our 
discussions, the reason why we don’t always understand each other well enough. 
I know it may be a lot of work to shift a paradigm in TaskFlow team but if we 
did that we would get enough freedom for using TaskFlow in lots of cases.

Let me know what you think. I might have missed something.

DZ: Interesting idea! So that other models of execution are based on lazy 
execution model? TaskFlow implements this, we can use it, and for other clients 
more convenient higher level execution models are provided? Interesting. Makes 
sense.
@Joshua? @Kirill? Others?

I think this is likely possible, which is simiar to whats in 
http

Re: [openstack-dev] [Mistral][TaskFlow] Long running actions

2014-03-26 Thread Dmitri Zimine
 connected to how 
 your engines are executing (and the persistence and state-transitions that 
 they go through while running). Without persistence of state and transitions 
 there is no good way (a bad way of course can be created, by just redoing all 
 the work, but that's not always feasible or the best option) to accomplish 
 resuming in a sane manner and there is also imho no way to accomplish any 
 type of automated HA of workflows. Since taskflow was concieved to manage the 
 states and transitions of tasks and flows it gains the ability to do this 
 resuming but it also gains the ability to automatically provide execution HA 
 to its users.
 
 Let me describe:
 
 When you save the states of a workflow and any intermediate results of a 
 workflow to some database (for example) and the engine (see above models) 
 which is being used (for example the conductor type from above) the 
 application containing that engine may be prone to crashes (or just being 
 powered off due to software upgrades...). Since taskflows key primitives were 
 made to allow for resuming when a crash occurs, it is relatively simple to 
 allow another application (also running a conductor) to resume whatever that 
 prior application was doing when it crashed. Now most users of taskflow don't 
 want to have to do this resumption manually (although they can if they want) 
 so it would be expected that the other versions of that application would be 
 running would automatically 'know' how to 'take-over' the work of the failed 
 application. This is where the concept of the taskflows 'jobboard' 
 (http://tinyurl.com/klg358j) comes into play, where a jobboard can be backed 
 by something like zookeeper (which provides notifications of lock 
 lose/release to others automatically). The jobboard is the place where the 
 other applications would be looking to 'take-over' the other failed 
 applications work (by using zookeeper 'atomic' primitives designed for this 
 type of usage) and they would also be releasing the work back for others to 
 'take-over' when there own zookeeper connection is lost (zookeeper handles 
 this this natively).
 
 --
 
 Now as for how much of mistral would change from the above, I don't know, but 
 thats why it's a POC.
 
 -Josh
 
 From: Joshua Harlow harlo...@yahoo-inc.com
 Date: Friday, March 21, 2014 at 1:14 PM
 To: OpenStack Development Mailing List (not for usage questions) 
 openstack-dev@lists.openstack.org
 Cc: OpenStack Development Mailing List (not for usage questions) 
 openstack-dev@lists.openstack.org
 Subject: Re: [openstack-dev] [Mistral][TaskFlow] Long running actions
 
 Will advise soon, out sick with not so fun case of poison oak, will reply 
 next week (hopefully) when I'm less incapacitated...
 
 Sent from my really tiny device...
 
 On Mar 21, 2014, at 3:24 AM, Renat Akhmerov rakhme...@mirantis.com wrote:
 
 Valid concerns. It would be great to get Joshua involved in this 
 discussion. If it’s possible to do in TaskFlow he could advise on how 
 exactly.
 
 Renat Akhmerov
 @ Mirantis Inc.
 
 
 
 On 21 Mar 2014, at 16:23, Stan Lagun sla...@mirantis.com wrote:
 
 Don't forget HA issues. Mistral can be restarted at any moment and need to 
 be able to proceed from the place it was interrupted on another instance. 
 In theory it can be addressed by TaskFlow but I'm not sure it can be done 
 without complete redesign of it
 
 
 On Fri, Mar 21, 2014 at 8:33 AM, W Chan m4d.co...@gmail.com wrote:
 Can the long running task be handled by putting the target task in the 
 workflow in a persisted state until either an event triggers it or 
 timeout occurs?  An event (human approval or trigger from an external 
 system) sent to the transport will rejuvenate the task.  The timeout is 
 configurable by the end user up to a certain time limit set by the 
 mistral admin.  
 
 Based on the TaskFlow examples, it seems like the engine instance 
 managing the workflow will be in memory until the flow is completed.  
 Unless there's other options to schedule tasks in TaskFlow, if we have 
 too many of these workflows with long running tasks, seems like it'll 
 become a memory issue for mistral...
 
 
 On Thu, Mar 20, 2014 at 3:07 PM, Dmitri Zimine d...@stackstorm.com 
 wrote:
 
 For the 'asynchronous manner' discussion see 
 http://tinyurl.com/n3v9lt8; I'm still not sure why u would want to make 
 is_sync/is_async a primitive concept in a workflow system, shouldn't 
 this be only up to the entity running the workflow to decide? Why is a 
 task allowed to be sync/async, that has major side-effects for 
 state-persistence, resumption (and to me is a incorrect abstraction to 
 provide) and general workflow execution control, I'd be very careful 
 with this (which is why I am hesitant to add it without much much more 
 discussion).
 
 
 Let's remove the confusion caused by async. All tasks [may] run async 
 from the engine standpoint, agreed. 
 
 Long running tasks - that's it.
 
 Examples: wait_5_days, run_hadoop_job

Re: [openstack-dev] [Mistral][TaskFlow] Long running actions

2014-03-26 Thread Joshua Harlow
Cool, sounds great.

I think all 3 models can co-exist (since each serves a good purpose), it'd be 
intersting to see how the POC 'engine' can become a taskflow 'engine' (aka the 
lazy_engine).

As to scalability I agree lazy_engine would be nicer, but how much more 
scalable is a tough one to quantify (the openstack systems that have active 
conductors, aka model #2, seem to scale pretty well).

Of course there are some interesting questions laziness brings up; it'd be 
interesting to see how the POC addressed them.

Some questions I can think of (currently), maybe u can address them in the 
other thread (which is fine to).

What does the watchdog do? Is it activated periodically to 'reap' jobs that 
have timed out (or have gone past some time limit)? How does the watchdog know 
that it is reaping jobs that are not actively being worked on (a timeout likely 
isn't sufficient for jobs that just take a very long time)? Is there a 
connection into zookepeer (or some similar system) to do this kind of 
'liveness' verification instead? What does the watchdog do when reaping tasks? 
(revert them, retry them, other..?)

I'm not quite sure how a taskflow would use mistral as a client for this 
watchdog, since the watchdog process is pretty key to the lazy_engines 
execution model and it seems like it would be a bad idea to split that logic 
from the actual execution model itself (seeing that the watchdog is involved in 
the execution process, and really isn't external to it). To me the concept of 
the lazy_engine is similar to the case where an engine 'crashes' while running, 
in a way the lazy_engine 'crashes on purpose' after asking a set of workers to 
do some action (and hands over the resumption of 'itself' to this watchdog 
process). The watchdog then watches over the workers, and on response from some 
worker the watchdog resumes the engine and then lets the engine 'crash on 
purpose' again (and repeat). So the watchdog - lazy_engine execution model 
seems to be pretty interconnected.

-Josh

From: Dmitri Zimine d...@stackstorm.commailto:d...@stackstorm.com
Reply-To: OpenStack Development Mailing List (not for usage questions) 
openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
Date: Wednesday, March 26, 2014 at 2:12 PM
To: OpenStack Development Mailing List (not for usage questions) 
openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [Mistral][TaskFlow] Long running actions

=== Long-running delegate [1] actions ==

Yes, the third model of lazy / passive engine is needed.

Obviously workflows contain a mix of different tasks, so this 3rd model should 
handle both normal tasks (run on a workers and return) and long running 
delegates. The active mechanism which is alive during the process, currently 
in done by TaskFlow engine,  may be moved from the TaskFlow library to a client 
(Mistral) which implements the watchdog. This  may require a lower-level API to 
TaskFlow.

The benefit of the model 2 is 'ease of use' for some clients (create tasks, 
define flow, instantiate engine, engine.run(), that's it!). But I agree that 
the model 2 - worker-based TaskFlow engine - won't scale to WFaaS requirements 
even though the engine is not doing much.

Mistral POC implements  a passive, lazy workflow model: a service moving the 
states of multiple parallel executions. I'll detail the how Mistral handles 
long running tasks in a separate thread (may be here 
http://tinyurl.com/n3v9lt8) and we can look at how TaskFlow may change to fit.

DZ

PS. Thanks for clarifications on the target use cases for the execution models!


[1] Calling them 'delegate actions' to distinguish between long running 
computations on workers, and actions that delegate to 3rd party systems (hadoop 
job, human input gateway, etc).


On Mar 24, 2014, at 11:51 AM, Joshua Harlow 
harlo...@yahoo-inc.commailto:harlo...@yahoo-inc.com wrote:

So getting back to this thread.

I'd like to split it up into a few sections to address the HA and 
long-running-actions cases, which I believe are 2 seperate (but connected) 
questions.

=== Long-running actions ===

First, let me describe a little bit about what I believe are the execution 
models that taskflow currently targets (but is not limited to just targeting in 
general).

The first execution model I would call the local execution model, this model 
involves forming tasks and flows and then executing them inside an application, 
that application is running for the duration of the workflow (although if it 
crashes it can re-establish the task and flows that it was doing and attempt to 
resume them). This could also be what openstack projects would call the 
'conductor' approach where nova, ironic, trove have a conductor which manages 
these long-running actions (the conductor is alive/running throughout the 
duration of these workflows, although it may be restarted while running). The 
restarting + resuming part is something

Re: [openstack-dev] [Mistral][TaskFlow] Long running actions

2014-03-25 Thread Adam Young

On 03/21/2014 12:33 AM, W Chan wrote:
Can the long running task be handled by putting the target task in the 
workflow in a persisted state until either an event triggers it or 
timeout occurs?  An event (human approval or trigger from an external 
system) sent to the transport will rejuvenate the task.  The timeout 
is configurable by the end user up to a certain time limit set by the 
mistral admin.


Based on the TaskFlow examples, it seems like the engine instance 
managing the workflow will be in memory until the flow is completed. 
 Unless there's other options to schedule tasks in TaskFlow, if we 
have too many of these workflows with long running tasks, seems like 
it'll become a memory issue for mistral...


Look into the Trusts capability of Keystone for Authorization support 
on long running tasks.





On Thu, Mar 20, 2014 at 3:07 PM, Dmitri Zimine d...@stackstorm.com 
mailto:d...@stackstorm.com wrote:




For the 'asynchronous manner' discussion see
http://tinyurl.com/n3v9lt8; I'm still not sure why u would want
to make is_sync/is_async a primitive concept in a workflow
system, shouldn't this be only up to the entity running the
workflow to decide? Why is a task allowed to be sync/async, that
has major side-effects for state-persistence, resumption (and to
me is a incorrect abstraction to provide) and general workflow
execution control, I'd be very careful with this (which is why I
am hesitant to add it without much much more discussion).


Let's remove the confusion caused by async. All tasks [may] run
async from the engine standpoint, agreed.

Long running tasks - that's it.

Examples: wait_5_days, run_hadoop_job, take_human_input.
The Task doesn't do the job: it delegates to an external system.
The flow execution needs to wait  (5 days passed, hadoob job
finished with data x, user inputs y), and than continue with the
received results.

The requirement is to survive a restart of any WF component
without loosing the state of the long running operation.

Does TaskFlow already have a way to do it? Or ongoing ideas,
considerations? If yes let's review. Else let's brainstorm together.

I agree,

that has major side-effects for state-persistence, resumption
(and to me is a incorrect abstraction to provide) and general
workflow execution control, I'd be very careful with this

But these requirement  comes from customers'  use cases:
wait_5_day - lifecycle management workflow, long running external
system - Murano requirements, user input - workflow for operation
automations with control gate checks, provisions which require
'approval' steps, etc.

DZ


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
mailto:OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Mistral][TaskFlow] Long running actions

2014-03-24 Thread Joshua Harlow
 of the failed 
application. This is where the concept of the taskflows 'jobboard' 
(http://tinyurl.com/klg358j) comes into play, where a jobboard can be backed by 
something like zookeeper (which provides notifications of lock lose/release to 
others automatically). The jobboard is the place where the other applications 
would be looking to 'take-over' the other failed applications work (by using 
zookeeper 'atomic' primitives designed for this type of usage) and they would 
also be releasing the work back for others to 'take-over' when there own 
zookeeper connection is lost (zookeeper handles this this natively).

--

Now as for how much of mistral would change from the above, I don't know, but 
thats why it's a POC.

-Josh

From: Joshua Harlow harlo...@yahoo-inc.commailto:harlo...@yahoo-inc.com
Date: Friday, March 21, 2014 at 1:14 PM
To: OpenStack Development Mailing List (not for usage questions) 
openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
Cc: OpenStack Development Mailing List (not for usage questions) 
openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [Mistral][TaskFlow] Long running actions

Will advise soon, out sick with not so fun case of poison oak, will reply next 
week (hopefully) when I'm less incapacitated...

Sent from my really tiny device...

On Mar 21, 2014, at 3:24 AM, Renat Akhmerov 
rakhme...@mirantis.commailto:rakhme...@mirantis.com wrote:

Valid concerns. It would be great to get Joshua involved in this discussion. If 
it’s possible to do in TaskFlow he could advise on how exactly.

Renat Akhmerov
@ Mirantis Inc.



On 21 Mar 2014, at 16:23, Stan Lagun 
sla...@mirantis.commailto:sla...@mirantis.com wrote:

Don't forget HA issues. Mistral can be restarted at any moment and need to be 
able to proceed from the place it was interrupted on another instance. In 
theory it can be addressed by TaskFlow but I'm not sure it can be done without 
complete redesign of it


On Fri, Mar 21, 2014 at 8:33 AM, W Chan 
m4d.co...@gmail.commailto:m4d.co...@gmail.com wrote:
Can the long running task be handled by putting the target task in the workflow 
in a persisted state until either an event triggers it or timeout occurs?  An 
event (human approval or trigger from an external system) sent to the transport 
will rejuvenate the task.  The timeout is configurable by the end user up to a 
certain time limit set by the mistral admin.

Based on the TaskFlow examples, it seems like the engine instance managing the 
workflow will be in memory until the flow is completed.  Unless there's other 
options to schedule tasks in TaskFlow, if we have too many of these workflows 
with long running tasks, seems like it'll become a memory issue for mistral...


On Thu, Mar 20, 2014 at 3:07 PM, Dmitri Zimine 
d...@stackstorm.commailto:d...@stackstorm.com wrote:

For the 'asynchronous manner' discussion see http://tinyurl.com/n3v9lt8; I'm 
still not sure why u would want to make is_sync/is_async a primitive concept in 
a workflow system, shouldn't this be only up to the entity running the workflow 
to decide? Why is a task allowed to be sync/async, that has major side-effects 
for state-persistence, resumption (and to me is a incorrect abstraction to 
provide) and general workflow execution control, I'd be very careful with this 
(which is why I am hesitant to add it without much much more discussion).

Let's remove the confusion caused by async. All tasks [may] run async from 
the engine standpoint, agreed.

Long running tasks - that's it.

Examples: wait_5_days, run_hadoop_job, take_human_input.
The Task doesn't do the job: it delegates to an external system. The flow 
execution needs to wait  (5 days passed, hadoob job finished with data x, user 
inputs y), and than continue with the received results.

The requirement is to survive a restart of any WF component without loosing the 
state of the long running operation.

Does TaskFlow already have a way to do it? Or ongoing ideas, considerations? If 
yes let's review. Else let's brainstorm together.

I agree,
that has major side-effects for state-persistence, resumption (and to me is a 
incorrect abstraction to provide) and general workflow execution control, I'd 
be very careful with this
But these requirement  comes from customers'  use cases: wait_5_day - lifecycle 
management workflow, long running external system - Murano requirements, user 
input - workflow for operation automations with control gate checks, provisions 
which require 'approval' steps, etc.

DZ


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.orgmailto:OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.orgmailto:OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Mistral][TaskFlow] Long running actions

2014-03-21 Thread Stan Lagun
Don't forget HA issues. Mistral can be restarted at any moment and need to
be able to proceed from the place it was interrupted on another instance.
In theory it can be addressed by TaskFlow but I'm not sure it can be done
without complete redesign of it


On Fri, Mar 21, 2014 at 8:33 AM, W Chan m4d.co...@gmail.com wrote:

 Can the long running task be handled by putting the target task in the
 workflow in a persisted state until either an event triggers it or timeout
 occurs?  An event (human approval or trigger from an external system) sent
 to the transport will rejuvenate the task.  The timeout is configurable by
 the end user up to a certain time limit set by the mistral admin.

 Based on the TaskFlow examples, it seems like the engine instance managing
 the workflow will be in memory until the flow is completed.  Unless there's
 other options to schedule tasks in TaskFlow, if we have too many of these
 workflows with long running tasks, seems like it'll become a memory issue
 for mistral...


 On Thu, Mar 20, 2014 at 3:07 PM, Dmitri Zimine d...@stackstorm.com wrote:


 For the 'asynchronous manner' discussion see http://tinyurl.com/n3v9lt8;
 I'm still not sure why u would want to make is_sync/is_async a primitive
 concept in a workflow system, shouldn't this be only up to the entity
 running the workflow to decide? Why is a task allowed to be sync/async,
 that has major side-effects for state-persistence, resumption (and to me is
 a incorrect abstraction to provide) and general workflow execution control,
 I'd be very careful with this (which is why I am hesitant to add it without
 much much more discussion).


 Let's remove the confusion caused by async. All tasks [may] run async
 from the engine standpoint, agreed.

 Long running tasks - that's it.

 Examples: wait_5_days, run_hadoop_job, take_human_input.
 The Task doesn't do the job: it delegates to an external system. The flow
 execution needs to wait  (5 days passed, hadoob job finished with data x,
 user inputs y), and than continue with the received results.

 The requirement is to survive a restart of any WF component without
 loosing the state of the long running operation.

 Does TaskFlow already have a way to do it? Or ongoing ideas,
 considerations? If yes let's review. Else let's brainstorm together.

 I agree,

 that has major side-effects for state-persistence, resumption (and to me
 is a incorrect abstraction to provide) and general workflow execution
 control, I'd be very careful with this

 But these requirement  comes from customers'  use cases: wait_5_day -
 lifecycle management workflow, long running external system - Murano
 requirements, user input - workflow for operation automations with control
 gate checks, provisions which require 'approval' steps, etc.

 DZ


 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




-- 
Sincerely yours
Stanislav (Stan) Lagun
Senior Developer
Mirantis
35b/3, Vorontsovskaya St.
Moscow, Russia
Skype: stanlagun
www.mirantis.com
sla...@mirantis.com
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Mistral][TaskFlow] Long running actions

2014-03-21 Thread Renat Akhmerov
Valid concerns. It would be great to get Joshua involved in this discussion. If 
it’s possible to do in TaskFlow he could advise on how exactly.

Renat Akhmerov
@ Mirantis Inc.



On 21 Mar 2014, at 16:23, Stan Lagun sla...@mirantis.com wrote:

 Don't forget HA issues. Mistral can be restarted at any moment and need to be 
 able to proceed from the place it was interrupted on another instance. In 
 theory it can be addressed by TaskFlow but I'm not sure it can be done 
 without complete redesign of it
 
 
 On Fri, Mar 21, 2014 at 8:33 AM, W Chan m4d.co...@gmail.com wrote:
 Can the long running task be handled by putting the target task in the 
 workflow in a persisted state until either an event triggers it or timeout 
 occurs?  An event (human approval or trigger from an external system) sent to 
 the transport will rejuvenate the task.  The timeout is configurable by the 
 end user up to a certain time limit set by the mistral admin.  
 
 Based on the TaskFlow examples, it seems like the engine instance managing 
 the workflow will be in memory until the flow is completed.  Unless there's 
 other options to schedule tasks in TaskFlow, if we have too many of these 
 workflows with long running tasks, seems like it'll become a memory issue for 
 mistral...
 
 
 On Thu, Mar 20, 2014 at 3:07 PM, Dmitri Zimine d...@stackstorm.com wrote:
 
 For the 'asynchronous manner' discussion see http://tinyurl.com/n3v9lt8; I'm 
 still not sure why u would want to make is_sync/is_async a primitive concept 
 in a workflow system, shouldn't this be only up to the entity running the 
 workflow to decide? Why is a task allowed to be sync/async, that has major 
 side-effects for state-persistence, resumption (and to me is a incorrect 
 abstraction to provide) and general workflow execution control, I'd be very 
 careful with this (which is why I am hesitant to add it without much much 
 more discussion).
 
 
 Let's remove the confusion caused by async. All tasks [may] run async from 
 the engine standpoint, agreed. 
 
 Long running tasks - that's it.
 
 Examples: wait_5_days, run_hadoop_job, take_human_input. 
 The Task doesn't do the job: it delegates to an external system. The flow 
 execution needs to wait  (5 days passed, hadoob job finished with data x, 
 user inputs y), and than continue with the received results.
 
 The requirement is to survive a restart of any WF component without loosing 
 the state of the long running operation.
 
 Does TaskFlow already have a way to do it? Or ongoing ideas, considerations? 
 If yes let's review. Else let's brainstorm together. 
 
 I agree,
 that has major side-effects for state-persistence, resumption (and to me is 
 a incorrect abstraction to provide) and general workflow execution control, 
 I'd be very careful with this
 
 But these requirement  comes from customers'  use cases: wait_5_day - 
 lifecycle management workflow, long running external system - Murano 
 requirements, user input - workflow for operation automations with control 
 gate checks, provisions which require 'approval' steps, etc. 
 
 DZ 
 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 
 
 
 -- 
 Sincerely yours
 Stanislav (Stan) Lagun
 Senior Developer
 Mirantis
 35b/3, Vorontsovskaya St.
 Moscow, Russia
 Skype: stanlagun
 www.mirantis.com
 sla...@mirantis.com
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Mistral][TaskFlow] Long running actions

2014-03-21 Thread Joshua Harlow
Will advise soon, out sick with not so fun case of poison oak, will reply next 
week (hopefully) when I'm less incapacitated...

Sent from my really tiny device...

On Mar 21, 2014, at 3:24 AM, Renat Akhmerov 
rakhme...@mirantis.commailto:rakhme...@mirantis.com wrote:

Valid concerns. It would be great to get Joshua involved in this discussion. If 
it’s possible to do in TaskFlow he could advise on how exactly.

Renat Akhmerov
@ Mirantis Inc.



On 21 Mar 2014, at 16:23, Stan Lagun 
sla...@mirantis.commailto:sla...@mirantis.com wrote:

Don't forget HA issues. Mistral can be restarted at any moment and need to be 
able to proceed from the place it was interrupted on another instance. In 
theory it can be addressed by TaskFlow but I'm not sure it can be done without 
complete redesign of it


On Fri, Mar 21, 2014 at 8:33 AM, W Chan 
m4d.co...@gmail.commailto:m4d.co...@gmail.com wrote:
Can the long running task be handled by putting the target task in the workflow 
in a persisted state until either an event triggers it or timeout occurs?  An 
event (human approval or trigger from an external system) sent to the transport 
will rejuvenate the task.  The timeout is configurable by the end user up to a 
certain time limit set by the mistral admin.

Based on the TaskFlow examples, it seems like the engine instance managing the 
workflow will be in memory until the flow is completed.  Unless there's other 
options to schedule tasks in TaskFlow, if we have too many of these workflows 
with long running tasks, seems like it'll become a memory issue for mistral...


On Thu, Mar 20, 2014 at 3:07 PM, Dmitri Zimine 
d...@stackstorm.commailto:d...@stackstorm.com wrote:

For the 'asynchronous manner' discussion see http://tinyurl.com/n3v9lt8; I'm 
still not sure why u would want to make is_sync/is_async a primitive concept in 
a workflow system, shouldn't this be only up to the entity running the workflow 
to decide? Why is a task allowed to be sync/async, that has major side-effects 
for state-persistence, resumption (and to me is a incorrect abstraction to 
provide) and general workflow execution control, I'd be very careful with this 
(which is why I am hesitant to add it without much much more discussion).

Let's remove the confusion caused by async. All tasks [may] run async from 
the engine standpoint, agreed.

Long running tasks - that's it.

Examples: wait_5_days, run_hadoop_job, take_human_input.
The Task doesn't do the job: it delegates to an external system. The flow 
execution needs to wait  (5 days passed, hadoob job finished with data x, user 
inputs y), and than continue with the received results.

The requirement is to survive a restart of any WF component without loosing the 
state of the long running operation.

Does TaskFlow already have a way to do it? Or ongoing ideas, considerations? If 
yes let's review. Else let's brainstorm together.

I agree,
that has major side-effects for state-persistence, resumption (and to me is a 
incorrect abstraction to provide) and general workflow execution control, I'd 
be very careful with this
But these requirement  comes from customers'  use cases: wait_5_day - lifecycle 
management workflow, long running external system - Murano requirements, user 
input - workflow for operation automations with control gate checks, provisions 
which require 'approval' steps, etc.

DZ


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.orgmailto:OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.orgmailto:OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




--
Sincerely yours
Stanislav (Stan) Lagun
Senior Developer
Mirantis
35b/3, Vorontsovskaya St.
Moscow, Russia
Skype: stanlagun
www.mirantis.comhttp://www.mirantis.com/
sla...@mirantis.commailto:sla...@mirantis.com
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.orgmailto:OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.orgmailto:OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [Mistral][TaskFlow] Long running actions

2014-03-20 Thread Dmitri Zimine

 For the 'asynchronous manner' discussion see http://tinyurl.com/n3v9lt8; I'm 
 still not sure why u would want to make is_sync/is_async a primitive concept 
 in a workflow system, shouldn't this be only up to the entity running the 
 workflow to decide? Why is a task allowed to be sync/async, that has major 
 side-effects for state-persistence, resumption (and to me is a incorrect 
 abstraction to provide) and general workflow execution control, I'd be very 
 careful with this (which is why I am hesitant to add it without much much 
 more discussion).


Let's remove the confusion caused by async. All tasks [may] run async from 
the engine standpoint, agreed. 

Long running tasks - that's it.

Examples: wait_5_days, run_hadoop_job, take_human_input. 
The Task doesn't do the job: it delegates to an external system. The flow 
execution needs to wait  (5 days passed, hadoob job finished with data x, user 
inputs y), and than continue with the received results.

The requirement is to survive a restart of any WF component without loosing the 
state of the long running operation.

Does TaskFlow already have a way to do it? Or ongoing ideas, considerations? If 
yes let's review. Else let's brainstorm together. 

I agree,
 that has major side-effects for state-persistence, resumption (and to me is a 
 incorrect abstraction to provide) and general workflow execution control, I'd 
 be very careful with this

But these requirement  comes from customers'  use cases: wait_5_day - lifecycle 
management workflow, long running external system - Murano requirements, user 
input - workflow for operation automations with control gate checks, provisions 
which require 'approval' steps, etc. 

DZ 

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Mistral][TaskFlow] Long running actions

2014-03-20 Thread W Chan
Can the long running task be handled by putting the target task in the
workflow in a persisted state until either an event triggers it or timeout
occurs?  An event (human approval or trigger from an external system) sent
to the transport will rejuvenate the task.  The timeout is configurable by
the end user up to a certain time limit set by the mistral admin.

Based on the TaskFlow examples, it seems like the engine instance managing
the workflow will be in memory until the flow is completed.  Unless there's
other options to schedule tasks in TaskFlow, if we have too many of these
workflows with long running tasks, seems like it'll become a memory issue
for mistral...


On Thu, Mar 20, 2014 at 3:07 PM, Dmitri Zimine d...@stackstorm.com wrote:


 For the 'asynchronous manner' discussion see http://tinyurl.com/n3v9lt8;
 I'm still not sure why u would want to make is_sync/is_async a primitive
 concept in a workflow system, shouldn't this be only up to the entity
 running the workflow to decide? Why is a task allowed to be sync/async,
 that has major side-effects for state-persistence, resumption (and to me is
 a incorrect abstraction to provide) and general workflow execution control,
 I'd be very careful with this (which is why I am hesitant to add it without
 much much more discussion).


 Let's remove the confusion caused by async. All tasks [may] run async
 from the engine standpoint, agreed.

 Long running tasks - that's it.

 Examples: wait_5_days, run_hadoop_job, take_human_input.
 The Task doesn't do the job: it delegates to an external system. The flow
 execution needs to wait  (5 days passed, hadoob job finished with data x,
 user inputs y), and than continue with the received results.

 The requirement is to survive a restart of any WF component without
 loosing the state of the long running operation.

 Does TaskFlow already have a way to do it? Or ongoing ideas,
 considerations? If yes let's review. Else let's brainstorm together.

 I agree,

 that has major side-effects for state-persistence, resumption (and to me
 is a incorrect abstraction to provide) and general workflow execution
 control, I'd be very careful with this

 But these requirement  comes from customers'  use cases: wait_5_day -
 lifecycle management workflow, long running external system - Murano
 requirements, user input - workflow for operation automations with control
 gate checks, provisions which require 'approval' steps, etc.

 DZ


 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev