Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks

2014-04-04 Thread Renat Akhmerov

On 04 Apr 2014, at 07:33, Kirill Izotov enyk...@stackstorm.com wrote:

 Then, we can make task executor interface public and allow clients to
 provide their own task executors. It will be possible then for Mistral
 to implement its own task executor, or several, and share the
 executors between all the engine instances.
 I'm afraid that if we start to tear apart the TaskFlow engine, it would 
 quickly become a mess to support. Besides, the amount of things left to 
 integrate after we throw out engine might be so low it proof the whole 
 process of integration to be just nominal and we are back to square one. Any 
 way, task execution is the part that least bothers me, both graph action and 
 the engine itself is where the pain will be.

Would love to see something additional (boxedarrows) explaining this approach. 
Sorry, I’m hardly following the idea.

 That is part of our public API, it is stable and good enough. Basically,
 I don't think this API needs any major change.
 
 But whatever should and will be done about it, I daresay all that work
 can be done without affecting API more then I described above.
 
 I completely agree that we should not change the public API of the sync 
 engine, especially the one in helpers. What we need is, on the contrary, a 
 low level construct that would do the number of things i stated previously, 
 but will be a part of public API of TaskFlow so we can be sure it would work 
 exactly the same way it worked yesterday.

I’m 99.9% sure we’ll have to change API because all we’ve been discussing 
so far made me think this is a key point going implicitly through all our 
discussions: without have a public method like “task_done” we won’t build truly 
passive/async execution model. And it doesn’t matter wether it uses futures, 
callbacks or whatever else inside.

And again, just want to repeat. If we will be able to deal with all the 
challenges that passive/async execution model exposes then other models can be 
built trivially on top of it.

@Ivan,

Thanks for joining the conversation. Looks like we really need your active 
participation for you’re the one who knows all the TF internals and concepts 
very well. As for what you wrote about futures and callbacks, it would be 
helpful to see some illustration of your idea.

Renat Akhmerov
@ Mirantis Inc.___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks

2014-04-03 Thread Renat Akhmerov

On 03 Apr 2014, at 12:49, Kirill Izotov enyk...@stackstorm.com wrote:

 Actually, the idea to make the concept more expandable is exactly my point =) 
 Mistral's Workbook is roughly the same as TaskFlow Flow and have nothing to 
 do with TaskFlow LogBook. The only major difference in case of Workbook is 
 that we have to store it after it was created using API or DSL. I very much 
 like the idea to inherit form basic Flow and add serialization and additional 
 metadata on top of it, but the concept of retrying within a Workbook\Flow is 
 not exactly what we have in mind. Instead of repeating the Flow, we are 
 repeating the Task (which has the same differences between Mistral and 
 TaskFlow as Workbook has) and to use the trick you have proposed we would 
 have to translate the Task into a Flow. How much sense does it make in 
 respect to expandability and generality?

Just one clarification. Workbook is a wider abstraction. Along with workflow 
itself it includes action definitions, triggers (even though it’s still 
controversial even within the team itself). Potentially it may include 
something else. So having a workbook in TaskFlow with strictly defined 
behaviour is not a good decision.

 Ya, I think u are hitting this issue, if a worker acks a message saying its 
 'working on it' then the worker dies, this is where a message queue ack 
 model won't really help in solving, the engine will believe that the worker 
 is working on it (since hey the worker acked it) but the engine will really 
 have no idea that the worker is actually dead. Of course timeouts can be 
 used to get around this but imho that’s really un-elegant when something 
 like zookeeper (or similar) can be used to be notified of this worker death 
 immediately; which means no timeouts are needed at all, and this separate 
 process or engine can be notified of this death (and resolve it 
 immediately). I don't believe this kind of 'liveness' connection is possible 
 with rabbit (we also must think of other things besides rabbit, like qpid 
 and such) but I might be wrong, I thought once a worker acks a message then 
 whoever receives that ack will believe the work has started and will be 
 finished someday in the future (aka, no connection that the work is actually 
 in progress). Anyways, we can discuss this more since I think its a common 
 confusion point :-)
 As far as i understand, what Renat is proposing here is not to acknowledge 
 the message until it was successfully executed. I'm not sure how exactly 
 RabbitMQ will react in that case, though you are right in the idea that we 
 must investigate the other solutions more carefully.

Joshua, I think your understanding of how it works is not exactly correct (or 
my understanding actually :)). Let’s talk in IRC, we’ll explain it.

I’d suggest we talk in IRC on Thursday at 8.30 pm (Friday 10.30 in our 
timezone).

Renat Akhmerov
@ Mirantis Inc.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks

2014-04-03 Thread Ivan Melnikov

I'm trying to catch up this rather long and interesting discussion,
sorry for somewhat late reply.

I can see aspects of 'lazy model' support in TaskFlow:
- how tasks are executed and reverted
- how flows are run
- how engine works internally

Let me address those aspects separately.

== Executing and reverting tasks ==

I think that should be done via different interface then running a flow
(or scheduling it to run), as it is completely different thing. In
current TaskFlow this interface is called task executor:
https://github.com/openstack/taskflow/blob/master/taskflow/engines/action_engine/executor.py#L57

That is actually how our WorkerBasedEngine was implemented: it's the
same engine with special task executor that schedules tasks on worker
instead of running task code locally.

Task executors are not aware of flows by design, all they do is
executing and reverting tasks. That means that task executors can be
easily shared between engines if that's wanted.

Current TaskExecutorBase interface uses futures (PEP 3148-like). When I
proposed it, futures looked like good tool for the task at hand (see
e.g. async task etherpad
https://etherpad.openstack.org/p/async-taskflow-tasks)

Now it may be time to reconsider that: having one future object per
running task may become a scalability issue. It may be worth to use
callbacks instead. It should not be too hard to refactor current engine
for that. Also, as TaskExecutorBase is an internal API, there should not
be any compatibility issues.

Then, we can make task executor interface public and allow clients to
provide their own task executors. It will be possible then for Mistral
to implement its own task executor, or several, and share the
executors between all the engine instances.

You can call it a plan;)

== Running flows ==

To run the flow TaskFlow client uses engine interface; also, there are
few of helper functions provided for convenience:

http://docs.openstack.org/developer/taskflow/engines.html#module-taskflow.engines.base
http://docs.openstack.org/developer/taskflow/engines.html#creating-engines

That is part of our public API, it is stable and good enough. Basically,
I don't think this API needs any major change.

Maybe it worth to add function or method to schedule running flow
without actually waiting for flow completion (at least, it was on my top
secret TODO list for quite a long time).

== Engine internals ==

Each engine eats resources, like thread it runs on; using these
resources to run one flow only is somewhat wasteful. Some work is
already planned to address this situation (see e.g.
https://blueprints.launchpad.net/taskflow/+spec/share-engine-thread).
Also, it might be good idea to implement different 'type' of engine to
support 'lazy' model, as Joshua suggests.

But whatever should and will be done about it, I daresay all that work
can be done without affecting API more then I described above.

-- 
WBR,
Ivan A. Melnikov

... tasks must flow ...


On 02.04.2014 01:51, Dmitri Zimine wrote:
 Even more responses inline :)
[...]

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks

2014-04-03 Thread Joshua Harlow
Thank Ivan,

This does seem like a possible way forward also.

I'd be interesting to see what/how callbacks would work vs. futures and
what extension point we could provide to mistral for task execution (maybe
there task executor would complete by doing a call to some service, not
amqp for example?).

Maybe some example/POC code would help all :-)

-Josh

-Original Message-
From: Ivan Melnikov imelni...@griddynamics.com
Date: Thursday, April 3, 2014 at 12:04 PM
To: OpenStack Development Mailing List (not for usage questions)
openstack-dev@lists.openstack.org, Joshua Harlow harlo...@yahoo-inc.com
Subject: Re: [openstack-dev] [Mistral] How Mistral handling long running
delegate tasks


I'm trying to catch up this rather long and interesting discussion,
sorry for somewhat late reply.

I can see aspects of 'lazy model' support in TaskFlow:
- how tasks are executed and reverted
- how flows are run
- how engine works internally

Let me address those aspects separately.

== Executing and reverting tasks ==

I think that should be done via different interface then running a flow
(or scheduling it to run), as it is completely different thing. In
current TaskFlow this interface is called task executor:
https://github.com/openstack/taskflow/blob/master/taskflow/engines/action_
engine/executor.py#L57

That is actually how our WorkerBasedEngine was implemented: it's the
same engine with special task executor that schedules tasks on worker
instead of running task code locally.

Task executors are not aware of flows by design, all they do is
executing and reverting tasks. That means that task executors can be
easily shared between engines if that's wanted.

Current TaskExecutorBase interface uses futures (PEP 3148-like). When I
proposed it, futures looked like good tool for the task at hand (see
e.g. async task etherpad
https://etherpad.openstack.org/p/async-taskflow-tasks)

Now it may be time to reconsider that: having one future object per
running task may become a scalability issue. It may be worth to use
callbacks instead. It should not be too hard to refactor current engine
for that. Also, as TaskExecutorBase is an internal API, there should not
be any compatibility issues.

Then, we can make task executor interface public and allow clients to
provide their own task executors. It will be possible then for Mistral
to implement its own task executor, or several, and share the
executors between all the engine instances.

You can call it a plan;)

== Running flows ==

To run the flow TaskFlow client uses engine interface; also, there are
few of helper functions provided for convenience:

http://docs.openstack.org/developer/taskflow/engines.html#module-taskflow.
engines.base
http://docs.openstack.org/developer/taskflow/engines.html#creating-engines

That is part of our public API, it is stable and good enough. Basically,
I don't think this API needs any major change.

Maybe it worth to add function or method to schedule running flow
without actually waiting for flow completion (at least, it was on my top
secret TODO list for quite a long time).

== Engine internals ==

Each engine eats resources, like thread it runs on; using these
resources to run one flow only is somewhat wasteful. Some work is
already planned to address this situation (see e.g.
https://blueprints.launchpad.net/taskflow/+spec/share-engine-thread).
Also, it might be good idea to implement different 'type' of engine to
support 'lazy' model, as Joshua suggests.

But whatever should and will be done about it, I daresay all that work
can be done without affecting API more then I described above.

-- 
WBR,
Ivan A. Melnikov

... tasks must flow ...


On 02.04.2014 01:51, Dmitri Zimine wrote:
 Even more responses inline :)
[...]


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks

2014-04-03 Thread Kirill Izotov
 Then, we can make task executor interface public and allow clients to
 provide their own task executors. It will be possible then for Mistral
 to implement its own task executor, or several, and share the
 executors between all the engine instances.

I'm afraid that if we start to tear apart the TaskFlow engine, it would quickly 
become a mess to support. Besides, the amount of things left to integrate after 
we throw out engine might be so low it proof the whole process of integration 
to be just nominal and we are back to square one. Any way, task execution is 
the part that least bothers me, both graph action and the engine itself is 
where the pain will be.

 That is part of our public API, it is stable and good enough. Basically,
 I don't think this API needs any major change.


 But whatever should and will be done about it, I daresay all that work
 can be done without affecting API more then I described above.



I completely agree that we should not change the public API of the sync engine, 
especially the one in helpers. What we need is, on the contrary, a low level 
construct that would do the number of things i stated previously, but will be a 
part of public API of TaskFlow so we can be sure it would work exactly the same 
way it worked yesterday.

--  
Kirill Izotov


пятница, 4 апреля 2014 г. в 2:04, Ivan Melnikov написал:

  
 I'm trying to catch up this rather long and interesting discussion,
 sorry for somewhat late reply.
  
 I can see aspects of 'lazy model' support in TaskFlow:
 - how tasks are executed and reverted
 - how flows are run
 - how engine works internally
  
 Let me address those aspects separately.
  
 == Executing and reverting tasks ==
  
 I think that should be done via different interface then running a flow
 (or scheduling it to run), as it is completely different thing. In
 current TaskFlow this interface is called task executor:
 https://github.com/openstack/taskflow/blob/master/taskflow/engines/action_engine/executor.py#L57
  
 That is actually how our WorkerBasedEngine was implemented: it's the
 same engine with special task executor that schedules tasks on worker
 instead of running task code locally.
  
 Task executors are not aware of flows by design, all they do is
 executing and reverting tasks. That means that task executors can be
 easily shared between engines if that's wanted.
  
 Current TaskExecutorBase interface uses futures (PEP 3148-like). When I
 proposed it, futures looked like good tool for the task at hand (see
 e.g. async task etherpad
 https://etherpad.openstack.org/p/async-taskflow-tasks)
  
 Now it may be time to reconsider that: having one future object per
 running task may become a scalability issue. It may be worth to use
 callbacks instead. It should not be too hard to refactor current engine
 for that. Also, as TaskExecutorBase is an internal API, there should not
 be any compatibility issues.
  
 Then, we can make task executor interface public and allow clients to
 provide their own task executors. It will be possible then for Mistral
 to implement its own task executor, or several, and share the
 executors between all the engine instances.
  
 You can call it a plan;)
  
 == Running flows ==
  
 To run the flow TaskFlow client uses engine interface; also, there are
 few of helper functions provided for convenience:
  
 http://docs.openstack.org/developer/taskflow/engines.html#module-taskflow.engines.base
 http://docs.openstack.org/developer/taskflow/engines.html#creating-engines
  
 That is part of our public API, it is stable and good enough. Basically,
 I don't think this API needs any major change.
  
 Maybe it worth to add function or method to schedule running flow
 without actually waiting for flow completion (at least, it was on my top
 secret TODO list for quite a long time).
  
 == Engine internals ==
  
 Each engine eats resources, like thread it runs on; using these
 resources to run one flow only is somewhat wasteful. Some work is
 already planned to address this situation (see e.g.
 https://blueprints.launchpad.net/taskflow/+spec/share-engine-thread).
 Also, it might be good idea to implement different 'type' of engine to
 support 'lazy' model, as Joshua suggests.
  
 But whatever should and will be done about it, I daresay all that work
 can be done without affecting API more then I described above.
  
 --  
 WBR,
 Ivan A. Melnikov
  
 ... tasks must flow ...
  
  
 On 02.04.2014 01:51, Dmitri Zimine wrote:
  Even more responses inline :)
  
 [...]
  
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org (mailto:OpenStack-dev@lists.openstack.org)
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
  
  


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks

2014-04-02 Thread Renat Akhmerov
On 02 Apr 2014, at 13:38, Kirill Izotov enyk...@stackstorm.com wrote:

 I agree that we probably need new engine for that kind of changes and, as 
 Renat already said in another thread, lazy model seems to be more basic and 
 it would be easier to build sync engine on top of that rather than other way 
 around. Yes, it will entail a lot of changes in engines that are currently 
 here, but it seems like the only way to get something that would fit us both.
 
 Since it seems like we are getting some kind of agreement here, we should 
 probably start shifting to the plane where we discuss the design of the new 
 engine, rather than the need of one. The idea has been spread over to many 
 places, so i'll try to gather it back. 
 
 Lazy engine should be async and atomic, it should not have its own state, 
 instead it should rely on on some kind of global state (db or in-memory, 
 depending on a type of application). I should have at least two methods: run 
 and task_complete. Run method should calculate the first batch of tasks and 
 schedule them to execution (either put them in queue or spawn the threads... 
 or send a pidgin, i insist =). Task_complete should mark a certain task to be 
 completed and then schedule the next batch of tasks that became available due 
 to resolution of this one.
 
 Then, on top of it you can build sync engine by introducing Futures. You are 
 using async.run() to schedule the tasks by transforming them to Futures and 
 then starting a loop, checking Futures for completion and sending their 
 results to async.task_complete() which would produce even more Futures to 
 check over. Just the same way TaskFlow do it right now.
 
 On the Mistral side we are using Lazy engine by patching async.run to the API 
 (or engine queue) and async.task_complete to the worker queue result channel 
 (and the API for long running tasks). We still sharing the same 
 graph_analyzer, but instead of relying on loop and Futures, we are handling 
 execution in a scalable and robust way.
 
 The reason i'm proposing to extract Futures from async engine is because they 
 won't work if we have multiple engines that should handle the task results 
 concurrently (and without that there will be no scalability).

I agree with all the major points here. It sounds like a plan.

 What i see is bothering Joshua is that worker that handles long running task 
 may just die in a process and there is no way for us to know for sure is it 
 still working or not. It is more important for sync engine because without 
 that it would just hang forever (though it is not clear do sync engine needs 
 long running tasks at all?). In case of sync engine, the role of watch-dog 
 may be performed by the loop, while in case of Mistral it might be a separate 
 process (though i bet there should be something for resending the message in 
 the RabbitMQ itself). The common solution is not necessary here.


As for not loosing messages (resending them to the queue) that’s why I 
mentioned message acknowledgement ([0]). Rabbit does it automatically if a 
message that has been pulled out of queue should be acknowledged and the 
corresponding rabbit client connection dies before the client acknowledges the 
message (at this point there’s a guarantee it’s been processed). It works this 
way now in Mistral. A slight problem with the current Mistral implementation is 
that engine works as follows when starting a workflow:

1.
Start DB Transaction
Make all smart logic and DB operations (updating persistent state of 
tasks and execution)
Commit DB Transaction
2. Submit task messages to MQ.

So there’s a window between steps 1 and 2 when if the system crashes DB state 
won’t be consistent with the state of the MQ. From a worker perspective looks 
like this problem doesn’t exist (at least we thought through it and didn’t find 
any synchronisation windows). Initially step 2 was a part of step 1 (a part of 
DB TX) and it would solve the problem but it’s generally a bad pattern to have 
interprocess communication inside DB TX so we got rid of it.

So this seems to me the only problem with the current architecture, we left it 
as it was and were going to get back to it after POC. I assume this might be 
describing the reason why Josh keeps talking about ‘jobboard’ thing, but I just 
don’t have enough details at this point to say if that’s what we need or not.

One solution I see here is just to consider this situation exceptional from 
workflow execution model perspective and apply policies to tasks that are 
persisted in DB but not submitted to MQ (what you guys discussed about retries 
and all that stuff..). So our watch-dog process could just resubmit tasks that 
have been hanging in DB longer than a configured period with the state IDLE. 
Something like that.

[0] http://www.rabbitmq.com/tutorials/tutorial-two-python.html

Renat Akhmerov
@ Mirantis Inc.

___
OpenStack-dev mailing list

Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks

2014-04-01 Thread Dmitri Zimine
Even more responses inline :)
On Mar 31, 2014, at 8:44 PM, Joshua Harlow harlo...@yahoo-inc.com wrote:

 More responses inline :)
 
 From: Dmitri Zimine d...@stackstorm.com
 Date: Monday, March 31, 2014 at 5:59 PM
 To: Joshua Harlow harlo...@yahoo-inc.com
 Cc: OpenStack Development Mailing List (not for usage questions) 
 openstack-dev@lists.openstack.org
 Subject: Re: [openstack-dev] [Mistral] How Mistral handling long running 
 delegate tasks
 
 Inline...
 On Mar 27, 2014, at 5:10 PM, Joshua Harlow harlo...@yahoo-inc.com wrote:
 
 Thanks for the description!
 
 The steps here seem very much like what a taskflow engine does (which is 
 good).
 
 To connect this to how I think could work in taskflow.
 Someone creates tasks/flows describing the work to-be-done (converting a 
 DSL - taskflow tasks/flows/retry[1] objects…)
 On execute(workflow) engine creates a new workflow execution, computes the 
 first batch of tasks, creates executor for those tasks (remote, local…) and 
 executes those tasks.
 Waits for response back from futures returned from executor.
 Receives futures responses (or receives new response DELAY, for example), 
 or exceptions…
 Continues sending out batches of tasks that can be still be executing (aka 
 tasks that don't have dependency on output of delayed tasks).
 If any delayed tasks after repeating #2-5 as many times as it can, the 
 engine will shut itself down (see http://tinyurl.com/l3x3rrb).
 Why would engine treat long running tasks differently? The model Mistral 
 tried out is the engine sends the batch of tasks and goes asleep; the 
 'worker/executor' is calling engine back when the task(s) complete. Can it 
 be applied 
 
 
 Not all ways of using taskflow I think want to go 'asleep', some active 
 conductors I would think actually stay 'awake', that’s why I was thinking 
 this would be a comprimise, the usage of 'DELAY' (via exception or return) 
 would allow a task to say it is going to be finished sometime in the future 
 (btw started this @ https://review.openstack.org/#/c/84277/ --- WIP, likely 
 won't go into taskflow 0.2 but seems like a reasonable start on something 
 like this proposal). To me offering just the going asleep way means that all 
 users of taskflow need to have some type of entrypoint (rest, mq, other…) 
 that restarts the engine on result finishing. I'd rather not force that model 
 onto all users (since that seems inappropriate).
I understand and agree with not enforcing this model on all users. On a flip 
side, TaskFlow enforces the current model on Mistral :) It calls for supporting 
more models (to your earlier point of 3rd mode of operations). The BP proposal 
is a step in the direction but still a compromise. How can we support both 
'active' and 'passive' models, sharing the same 'library' - of all things 
responsible for control flow and data flow?

Here is a strawman (DISCLAIMER: this is sharing ideas and brainstorming: I 
don't know enough of TaksFlow to suggest): 
1) break the run loop: add engine.run_async() method which instead of looping 
the tasks till the flow completes (http://tinyurl.com/nou2em9), only schedules 
the first batch and returns; 
https://github.com/openstack/taskflow/blob/master/taskflow/engines/action_engine/graph_action.py#L61-L68

2) introduce engine.task_complete(…) method that marks the current task 
complete, computes the next tasks ready for execution, schedules them and 
returns; loosely, doing this part: 
https://github.com/openstack/taskflow/blob/master/taskflow/engines/action_engine/graph_action.py#L75-L85

3) make executor signal engine back by calling engine.task_complete(…): 
executor.execute_task() submits the task to the worker queue and calls 
engine.task_complete(status, results, etc.) instead of waiting for task 
completion and returning the results. 

And looks like it all can be done as a third lazy_engine, without messing the 
current two engines? 


 
 
 On delay task finishing some API/webhook/other (the mechanism imho 
 shouldn't be tied to webhooks, at least not in taskflow, but should be left 
 up to the user of taskflow to decide how to accomplish this) will be/must 
 be responsible for resuming the engine and setting the result for the 
 previous delayed task.
 Oh no, webhook is the way to expose it to 3rd party system. From the library 
 standpoint it's just an API call. 
 
 One can do it even now by getting the appropriate Flow_details, 
 instantiating and engine (flow, flow_details) and running it to continue 
 from where it left out. Is it how you mean it? But I keep on dreaming of a 
 passive version of TaskFlow engine which treats all tasks the same and 
 exposes one extra method - handle_tasks. 
 
 
 I was thinking that, although I'm unsure on the one extra method idea, can u 
 explain more :) What would handle_tasks do? Restart/resume the engine 
 (basically the work u stated 'getting the appropriate Flow_details, 
 instantiating and engine (flow, flow_details) and running

Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks

2014-04-01 Thread Joshua Harlow
Cool, my thoughts added ;)

From: Dmitri Zimine d...@stackstorm.commailto:d...@stackstorm.com
Date: Tuesday, April 1, 2014 at 2:51 PM
To: Joshua Harlow harlo...@yahoo-inc.commailto:harlo...@yahoo-inc.com
Cc: OpenStack Development Mailing List (not for usage questions) 
openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [Mistral] How Mistral handling long running 
delegate tasks

Even more responses inline :)
On Mar 31, 2014, at 8:44 PM, Joshua Harlow 
harlo...@yahoo-inc.commailto:harlo...@yahoo-inc.com wrote:

More responses inline :)

From: Dmitri Zimine d...@stackstorm.commailto:d...@stackstorm.com
Date: Monday, March 31, 2014 at 5:59 PM
To: Joshua Harlow harlo...@yahoo-inc.commailto:harlo...@yahoo-inc.com
Cc: OpenStack Development Mailing List (not for usage questions) 
openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [Mistral] How Mistral handling long running 
delegate tasks

Inline...
On Mar 27, 2014, at 5:10 PM, Joshua Harlow 
harlo...@yahoo-inc.commailto:harlo...@yahoo-inc.com wrote:

Thanks for the description!

The steps here seem very much like what a taskflow engine does (which is good).

To connect this to how I think could work in taskflow.

  1.  Someone creates tasks/flows describing the work to-be-done (converting a 
DSL - taskflow tasks/flows/retry[1] objects…)
  2.  On execute(workflow) engine creates a new workflow execution, computes 
the first batch of tasks, creates executor for those tasks (remote, local…) and 
executes those tasks.
  3.  Waits for response back from 
futureshttp://docs.python.org/dev/library/concurrent.futures.html returned 
from executor.
  4.  Receives futures responses (or receives new response DELAY, for example), 
or exceptions…
  5.  Continues sending out batches of tasks that can be still be executing 
(aka tasks that don't have dependency on output of delayed tasks).
  6.  If any delayed tasks after repeating #2-5 as many times as it can, the 
engine will shut itself down (see http://tinyurl.com/l3x3rrb).

Why would engine treat long running tasks differently? The model Mistral tried 
out is the engine sends the batch of tasks and goes asleep; the 
'worker/executor' is calling engine back when the task(s) complete. Can it be 
applied

Not all ways of using taskflow I think want to go 'asleep', some active 
conductors I would think actually stay 'awake', that’s why I was thinking this 
would be a comprimise, the usage of 'DELAY' (via exception or return) would 
allow a task to say it is going to be finished sometime in the future (btw 
started this @ https://review.openstack.org/#/c/84277/ --- WIP, likely won't go 
into taskflow 0.2 but seems like a reasonable start on something like this 
proposal). To me offering just the going asleep way means that all users of 
taskflow need to have some type of entrypoint (rest, mq, other…) that restarts 
the engine on result finishing. I'd rather not force that model onto all users 
(since that seems inappropriate).
I understand and agree with not enforcing this model on all users. On a flip 
side, TaskFlow enforces the current model on Mistral :) It calls for supporting 
more models (to your earlier point of 3rd mode of operations). The BP proposal 
is a step in the direction but still a compromise. How can we support both 
'active' and 'passive' models, sharing the same 'library' - of all things 
responsible for control flow and data flow?

To me this is what engines 'types' are for, to support different execution 
models while using as many of the same components as possible (making usage as 
seamless as can be).


Here is a strawman (DISCLAIMER: this is sharing ideas and brainstorming: I 
don't know enough of TaksFlow to suggest):
1) break the run loop: add engine.run_async() method which instead of looping 
the tasks till the flow completes (http://tinyurl.com/nou2em9), only schedules 
the first batch and returns;
https://github.com/openstack/taskflow/blob/master/taskflow/engines/action_engine/graph_action.py#L61-L68

Seems like a possiblity, although I'd like to make this still a new engine type 
that does this (or maybe the existing engine types wrap this change to retain 
the usage that exists). This just seems like an extension of DELAY thought, 
where at each iteration the engine would return DELAYed, then the caller (the 
one executing) would become aware that this has happened and decide how to 
handle this accordingly. Exposing 'run_async' doesn't still feel right, maybe 
its just wording, but it seems like an 'async/lazy' engine just still has a 
run() method that returns DELAYed more often than an engine that loops and 
never returns DELAYed (or only returns DELAYed if a task causes this to 
happen). To me they are the same run() method, just different implementations 
of run(). Exposing the semantics of running by naming run_async exposes the 
underlying execution abstraction, which imho isn't

Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks

2014-04-01 Thread Renat Akhmerov
On 02 Apr 2014, at 05:45, Joshua Harlow harlo...@yahoo-inc.com wrote:

 Possibly, although to me this is still exposing the internal of engines to 
 users who shouldn't care (or only care if they are specifying an engine type 
 that gives them access to these details). Allowing public access to these 
 API's worries me in that they are now the API (which goes back to having an 
 engine type that exposes these, if that’s desired, and if we are willing to 
 accept the consequences of exposing them).

Given all we discussed by now calling it “internal of engines” is not correct 
anymore. If for some cases we know that only workers will be calling this API 
method and we need to protect the workflow execution from occasional calls from 
3rd parties I believe there’s a million ways how to solve this. The simplest 
thing that comes to my mind is just passing a generated token to confirm 
authority to perform this operation.

 Who responds to the timeout though? Isn't that process the watchdog then? 
 Likely the triggering of a timeout causes something to react (in both 
 cases).
 The workflow engine IS this watch-dog (and in Mistral, engine is a single 
 manager for all flow executions, in the prototype we call it engine_manager 
 and I hate this name :)) Engine live in process or as a separate process. 
 And it is passive - executed in a thread of a callee. E.g., in process is 
 either called on messaging event handler thread, or by web method thread.
 
 
 Right, which in mistral is a web-server right (aka, wherever mistral is 
 setup) since the tasks finish by calling a rest-endpoint (or something 
 sitting on MQ?)?

Not exactly right. Currently it’s a web server but we’re about to decouple 
engine and API server. Most of the work is done. Engine is supposed to listen 
to a queue and there may be any number of engines since they are stateless and 
hence what’s behind a web server can be scaled as needed. And actually a web 
server tier can be scaled easily too (assuming we have a loadbalancer in place).

 That may be good enough: when DSL is translated to flow, and the task 
 demands repetition with timeout, it's ok to do this trick under the hood 
 when compiling a flow. 
 flow.add(LinearFlow(subblahblah, 
 retry=XYZ).add(OneTask().add(Timeout(timeout))
 
 
 Yup, in a way most languages compilers do all these types of tricks under the 
 hood (in much much more complicated manners); as long as we retain 'user 
 intention' (aka don't mess how the code executes) we should be able to do any 
 tricks we want (in fact most compliers do many many tricks). To me the same 
 kind of tricks start to become possible after we get the basics right (can't 
 do optimizations, aka -O2, if u don't have basics in the first place).

I’d be careful about this assumption that we can convert DSL to flow, right now 
it’s impossible since we need to add more control flow primitives in TaskFlow. 
But that’s what Kirill described in the prototype description.

Renat Akhmerov
@ Mirantis Inc.___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks

2014-03-31 Thread Dmitri Zimine
Inline...
On Mar 27, 2014, at 5:10 PM, Joshua Harlow harlo...@yahoo-inc.com wrote:

 Thanks for the description!
 
 The steps here seem very much like what a taskflow engine does (which is 
 good).
 
 To connect this to how I think could work in taskflow.
 Someone creates tasks/flows describing the work to-be-done (converting a DSL 
 - taskflow tasks/flows/retry[1] objects…)
 On execute(workflow) engine creates a new workflow execution, computes the 
 first batch of tasks, creates executor for those tasks (remote, local…) and 
 executes those tasks.
 Waits for response back from futures returned from executor.
 Receives futures responses (or receives new response DELAY, for example), or 
 exceptions…
 Continues sending out batches of tasks that can be still be executing (aka 
 tasks that don't have dependency on output of delayed tasks).
 If any delayed tasks after repeating #2-5 as many times as it can, the engine 
 will shut itself down (see http://tinyurl.com/l3x3rrb).
Why would engine treat long running tasks differently? The model Mistral tried 
out is the engine sends the batch of tasks and goes asleep; the 
'worker/executor' is calling engine back when the task(s) complete. Can it be 
applied 
 On delay task finishing some API/webhook/other (the mechanism imho shouldn't 
 be tied to webhooks, at least not in taskflow, but should be left up to the 
 user of taskflow to decide how to accomplish this) will be/must be 
 responsible for resuming the engine and setting the result for the previous 
 delayed task.
Oh no, webhook is the way to expose it to 3rd party system. From the library 
standpoint it's just an API call. 

One can do it even now by getting the appropriate Flow_details, instantiating 
and engine (flow, flow_details) and running it to continue from where it left 
out. Is it how you mean it? But I keep on dreaming of a passive version of 
TaskFlow engine which treats all tasks the same and exposes one extra method - 
handle_tasks. 
 Repeat 2 - 7 until all tasks have executed/failed.
 Profit!

 This seems like it could be accomplished, although there are race conditions 
 in the #6 (what if multiple delayed requests are received at the same time)? 
 What locking is done to ensure that this doesn't cause conflicts?
Engine must handle concurrent calls of mutation methods - start, stop, 
handle_action. How -  differs depending on engine running in multiple threads 
or in event loop on queues of calls. 

 Does the POC solve that part (no simultaneous step #5 from below)?
Yes although we may want to revisit the current solution. 

 There was a mention of a watch-dog (ideally to ensure that delayed tasks 
 can't just sit around forever), was that implemented?
If _delayed_ tasks and 'normal' tasks are treat alike, this is just a matter of 
timeout as a generic property on a task. So Mistral didn't have to have it. For 
the proposal above, a separate treatment is necessary for _delayed_ tasks. 
 
 [1] https://wiki.openstack.org/wiki/TaskFlow#Retries (new feature!)
This is nice. I would call it a 'repeater': running a sub flow several times 
with various data for various reasons is reacher then 'retry'. 
What about the 'retry policy' on individual task? 

 
 From: Dmitri Zimine d...@stackstorm.com
 Reply-To: OpenStack Development Mailing List (not for usage questions) 
 openstack-dev@lists.openstack.org
 Date: Thursday, March 27, 2014 at 4:43 PM
 To: OpenStack Development Mailing List (not for usage questions) 
 openstack-dev@lists.openstack.org
 Subject: [openstack-dev] [Mistral] How Mistral handling long running delegate 
 tasks
 
 Following up on http://tinyurl.com/l8gtmsw and http://tinyurl.com/n3v9lt8: 
 this explains how Mistral handles long running delegate tasks. Note that a 
 'passive' workflow engine can handle both normal tasks and delegates the 
 same way. I'll also put that on ActionDesign wiki, after discussion.
 
 Diagram: 
 https://docs.google.com/a/stackstorm.com/drawings/d/147_EpdatpN_sOLQ0LS07SWhaC3N85c95TkKMAeQ_a4c/edit?usp=sharing
 
 1. On start(workflow), engine creates a new workflow execution, computes the 
 first batch of tasks, sends them to ActionRunner [1].
 2. ActionRunner creates an action and calls action.run(input)
 3. Action does the work (compute (10!)), produce the results,  and return 
 the results to executor. If it returns, status=SUCCESS. If it fails it 
 throws exception, status=ERROR.
 4. ActionRunner notifies Engine that the task is complete 
 task_done(execution, task, status, results)[2]
 5. Engine computes the next task(s) ready to trigger, according to control 
 flow and data flow, and sends them to ActionRunner.
 6. Like step 2: ActionRunner calls the action's run(input)
 7. A delegate action doesn't produce results: it calls out the 3rd party 
 system, which is expected to make a callback to a workflow service with the 
 results. It returns to ActionRunner without results, immediately.  
 8. ActionRunner marks status=RUNNING [?]
 9. 3rd party system

Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks

2014-03-31 Thread Joshua Harlow
More responses inline :)

From: Dmitri Zimine d...@stackstorm.commailto:d...@stackstorm.com
Date: Monday, March 31, 2014 at 5:59 PM
To: Joshua Harlow harlo...@yahoo-inc.commailto:harlo...@yahoo-inc.com
Cc: OpenStack Development Mailing List (not for usage questions) 
openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [Mistral] How Mistral handling long running 
delegate tasks

Inline...
On Mar 27, 2014, at 5:10 PM, Joshua Harlow 
harlo...@yahoo-inc.commailto:harlo...@yahoo-inc.com wrote:

Thanks for the description!

The steps here seem very much like what a taskflow engine does (which is good).

To connect this to how I think could work in taskflow.

  1.  Someone creates tasks/flows describing the work to-be-done (converting a 
DSL - taskflow tasks/flows/retry[1] objects…)
  2.  On execute(workflow) engine creates a new workflow execution, computes 
the first batch of tasks, creates executor for those tasks (remote, local…) and 
executes those tasks.
  3.  Waits for response back from 
futureshttp://docs.python.org/dev/library/concurrent.futures.html returned 
from executor.
  4.  Receives futures responses (or receives new response DELAY, for example), 
or exceptions…
  5.  Continues sending out batches of tasks that can be still be executing 
(aka tasks that don't have dependency on output of delayed tasks).
  6.  If any delayed tasks after repeating #2-5 as many times as it can, the 
engine will shut itself down (see http://tinyurl.com/l3x3rrb).

Why would engine treat long running tasks differently? The model Mistral tried 
out is the engine sends the batch of tasks and goes asleep; the 
'worker/executor' is calling engine back when the task(s) complete. Can it be 
applied

Not all ways of using taskflow I think want to go 'asleep', some active 
conductors I would think actually stay 'awake', that’s why I was thinking this 
would be a comprimise, the usage of 'DELAY' (via exception or return) would 
allow a task to say it is going to be finished sometime in the future (btw 
started this @ https://review.openstack.org/#/c/84277/ --- WIP, likely won't go 
into taskflow 0.2 but seems like a reasonable start on something like this 
proposal). To me offering just the going asleep way means that all users of 
taskflow need to have some type of entrypoint (rest, mq, other…) that restarts 
the engine on result finishing. I'd rather not force that model onto all users 
(since that seems inappropriate).



  1.  On delay task finishing some API/webhook/other (the mechanism imho 
shouldn't be tied to webhooks, at least not in taskflow, but should be left up 
to the user of taskflow to decide how to accomplish this) will be/must be 
responsible for resuming the engine and setting the result for the previous 
delayed task.

Oh no, webhook is the way to expose it to 3rd party system. From the library 
standpoint it's just an API call.

One can do it even now by getting the appropriate Flow_details, instantiating 
and engine (flow, flow_details) and running it to continue from where it left 
out. Is it how you mean it? But I keep on dreaming of a passive version of 
TaskFlow engine which treats all tasks the same and exposes one extra method - 
handle_tasks.

I was thinking that, although I'm unsure on the one extra method idea, can u 
explain more :) What would handle_tasks do? Restart/resume the engine 
(basically the work u stated 'getting the appropriate Flow_details, 
instantiating and engine (flow, flow_details) and running it to continue')? 
Seems like a tiny helper function if its really wanted, but maybe I'm 
misunderstanding.  It might be connected into a recent taskflow BP @ 
https://blueprints.launchpad.net/taskflow/+spec/generic-flow-conductor?

  1.  Repeat 2 - 7 until all tasks have executed/failed.
  2.  Profit!

This seems like it could be accomplished, although there are race conditions in 
the #6 (what if multiple delayed requests are received at the same time)? What 
locking is done to ensure that this doesn't cause conflicts?
Engine must handle concurrent calls of mutation methods - start, stop, 
handle_action. How -  differs depending on engine running in multiple threads 
or in event loop on queues of calls.

Agreed, to me this requires some level of locking, likely something that the 
toozhttps://review.openstack.org/#/c/71167/ library can provide (or in 
concept is similar to what the 
jobboardhttps://wiki.openstack.org/wiki/TaskFlow/Paradigm_shifts#Workflow_ownership_transfer
 concept provides, single 'atomic' engine access to a given workflow, ensuring 
this with a distributing locking scheme, such as zookeeper).


Does the POC solve that part (no simultaneous step #5 from below)?
Yes although we may want to revisit the current solution.

Is it using some form of database locking? :(


There was a mention of a watch-dog (ideally to ensure that delayed tasks can't 
just sit around forever), was that implemented?
If _delayed_ tasks

[openstack-dev] [Mistral] How Mistral handling long running delegate tasks

2014-03-27 Thread Dmitri Zimine
Following up on http://tinyurl.com/l8gtmsw and http://tinyurl.com/n3v9lt8:  
this explains how Mistral handles long running delegate tasks. Note that a 
'passive' workflow engine can handle both normal tasks and delegates the same 
way. I'll also put that on ActionDesign wiki, after discussion.

Diagram: 
https://docs.google.com/a/stackstorm.com/drawings/d/147_EpdatpN_sOLQ0LS07SWhaC3N85c95TkKMAeQ_a4c/edit?usp=sharing

1. On start(workflow), engine creates a new workflow execution, computes the 
first batch of tasks, sends them to ActionRunner [1].
2. ActionRunner creates an action and calls action.run(input)
3. Action does the work (compute (10!)), produce the results,  and return the 
results to executor. If it returns, status=SUCCESS. If it fails it throws 
exception, status=ERROR.
4. ActionRunner notifies Engine that the task is complete task_done(execution, 
task, status, results)[2]
5. Engine computes the next task(s) ready to trigger, according to control flow 
and data flow, and sends them to ActionRunner.
6. Like step 2: ActionRunner calls the action's run(input)
7. A delegate action doesn't produce results: it calls out the 3rd party 
system, which is expected to make a callback to a workflow service with the 
results. It returns to ActionRunner without results, immediately.  
8. ActionRunner marks status=RUNNING [?]
9. 3rd party system takes 'long time' == longer then any system component can 
be assumed to stay alive. 
10. 3rd party component calls Mistral WebHook which resolves to 
engine.task_complete(workbook, id, status, results)  

Comments: 
* One Engine handles multiple executions of multiple workflows. It exposes two 
main operations: start(workflow) and task_complete(execution, task, status, 
results), and responsible for defining the next batch of tasks based on control 
flow and data flow. Engine is passive - it runs in a hosts' thread. Engine and 
ActionRunner communicate via task queues asynchronously, for details, see  
https://wiki.openstack.org/wiki/Mistral/POC 

* Engine doesn't distinct sync and async actions, it doesn't deal with Actions 
at all. It only reacts to task completions, handling the results, updating the 
state, and queuing next set of tasks.

* Only Action can know and define if it is a delegate or not. Some protocol 
required to let ActionRunner know that the action is not returning the results 
immediately. A convention of returning None may be sufficient. 

* Mistral exposes  engine.task_done in the REST API so 3rd party systems can 
call a web hook.

DZ.

[1]  I use ActionRunner instead of Executor (current name) to avoid confusion: 
it is Engine which is responsible for executions, and ActionRunner only runs 
actions. We should rename it in the code.

[2] I use task_done for briefly and out of pure spite, in the code it is 
conveny_task_results.___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks

2014-03-27 Thread Joshua Harlow
Thanks for the description!

The steps here seem very much like what a taskflow engine does (which is good).

To connect this to how I think could work in taskflow.

  1.  Someone creates tasks/flows describing the work to-be-done (converting a 
DSL - taskflow tasks/flows/retry[1] objects…)
  2.  On execute(workflow) engine creates a new workflow execution, computes 
the first batch of tasks, creates executor for those tasks (remote, local…) and 
executes those tasks.
  3.  Waits for response back from 
futureshttp://docs.python.org/dev/library/concurrent.futures.html returned 
from executor.
  4.  Receives futures responses (or receives new response DELAY, for example), 
or exceptions…
  5.  Continues sending out batches of tasks that can be still be executing 
(aka tasks that don't have dependency on output of delayed tasks).
  6.  If any delayed tasks after repeating #2-5 as many times as it can, the 
engine will shut itself down (see http://tinyurl.com/l3x3rrb).
  7.  On delay task finishing some API/webhook/other (the mechanism imho 
shouldn't be tied to webhooks, at least not in taskflow, but should be left up 
to the user of taskflow to decide how to accomplish this) will be/must be 
responsible for resuming the engine and setting the result for the previous 
delayed task.
  8.  Repeat 2 - 7 until all tasks have executed/failed.
  9.  Profit!

This seems like it could be accomplished, although there are race conditions in 
the #6 (what if multiple delayed requests are received at the same time)? What 
locking is done to ensure that this doesn't cause conflicts? Does the POC solve 
that part (no simultaneous step #5 from below)? There was a mention of a 
watch-dog (ideally to ensure that delayed tasks can't just sit around forever), 
was that implemented?

[1] https://wiki.openstack.org/wiki/TaskFlow#Retries (new feature!)

From: Dmitri Zimine d...@stackstorm.commailto:d...@stackstorm.com
Reply-To: OpenStack Development Mailing List (not for usage questions) 
openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
Date: Thursday, March 27, 2014 at 4:43 PM
To: OpenStack Development Mailing List (not for usage questions) 
openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
Subject: [openstack-dev] [Mistral] How Mistral handling long running delegate 
tasks

Following up on http://tinyurl.com/l8gtmsw and http://tinyurl.com/n3v9lt8: this 
explains how Mistral handles long running delegate tasks. Note that a 'passive' 
workflow engine can handle both normal tasks and delegates the same way. I'll 
also put that on ActionDesign wiki, after discussion.

Diagram:
https://docs.google.com/a/stackstorm.com/drawings/d/147_EpdatpN_sOLQ0LS07SWhaC3N85c95TkKMAeQ_a4c/edit?usp=sharing

1. On start(workflow), engine creates a new workflow execution, computes the 
first batch of tasks, sends them to ActionRunner [1].
2. ActionRunner creates an action and calls action.run(input)
3. Action does the work (compute (10!)), produce the results,  and return the 
results to executor. If it returns, status=SUCCESS. If it fails it throws 
exception, status=ERROR.
4. ActionRunner notifies Engine that the task is complete task_done(execution, 
task, status, results)[2]
5. Engine computes the next task(s) ready to trigger, according to control flow 
and data flow, and sends them to ActionRunner.
6. Like step 2: ActionRunner calls the action's run(input)
7. A delegate action doesn't produce results: it calls out the 3rd party 
system, which is expected to make a callback to a workflow service with the 
results. It returns to ActionRunner without results, immediately.
8. ActionRunner marks status=RUNNING [?]
9. 3rd party system takes 'long time' == longer then any system component can 
be assumed to stay alive.
10. 3rd party component calls Mistral WebHook which resolves to 
engine.task_complete(workbook, id, status, results)

Comments:
* One Engine handles multiple executions of multiple workflows. It exposes two 
main operations: start(workflow) and task_complete(execution, task, status, 
results), and responsible for defining the next batch of tasks based on control 
flow and data flow. Engine is passive - it runs in a hosts' thread. Engine and 
ActionRunner communicate via task queues asynchronously, for details, see  
https://wiki.openstack.org/wiki/Mistral/POC

* Engine doesn't distinct sync and async actions, it doesn't deal with Actions 
at all. It only reacts to task completions, handling the results, updating the 
state, and queuing next set of tasks.

* Only Action can know and define if it is a delegate or not. Some protocol 
required to let ActionRunner know that the action is not returning the results 
immediately. A convention of returning None may be sufficient.

* Mistral exposes  engine.task_done in the REST API so 3rd party systems can 
call a web hook.

DZ.

[1]  I use ActionRunner instead of Executor (current name) to avoid confusion: 
it is Engine which is responsible