Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks
On 04 Apr 2014, at 07:33, Kirill Izotov enyk...@stackstorm.com wrote: Then, we can make task executor interface public and allow clients to provide their own task executors. It will be possible then for Mistral to implement its own task executor, or several, and share the executors between all the engine instances. I'm afraid that if we start to tear apart the TaskFlow engine, it would quickly become a mess to support. Besides, the amount of things left to integrate after we throw out engine might be so low it proof the whole process of integration to be just nominal and we are back to square one. Any way, task execution is the part that least bothers me, both graph action and the engine itself is where the pain will be. Would love to see something additional (boxedarrows) explaining this approach. Sorry, I’m hardly following the idea. That is part of our public API, it is stable and good enough. Basically, I don't think this API needs any major change. But whatever should and will be done about it, I daresay all that work can be done without affecting API more then I described above. I completely agree that we should not change the public API of the sync engine, especially the one in helpers. What we need is, on the contrary, a low level construct that would do the number of things i stated previously, but will be a part of public API of TaskFlow so we can be sure it would work exactly the same way it worked yesterday. I’m 99.9% sure we’ll have to change API because all we’ve been discussing so far made me think this is a key point going implicitly through all our discussions: without have a public method like “task_done” we won’t build truly passive/async execution model. And it doesn’t matter wether it uses futures, callbacks or whatever else inside. And again, just want to repeat. If we will be able to deal with all the challenges that passive/async execution model exposes then other models can be built trivially on top of it. @Ivan, Thanks for joining the conversation. Looks like we really need your active participation for you’re the one who knows all the TF internals and concepts very well. As for what you wrote about futures and callbacks, it would be helpful to see some illustration of your idea. Renat Akhmerov @ Mirantis Inc.___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks
On 03 Apr 2014, at 12:49, Kirill Izotov enyk...@stackstorm.com wrote: Actually, the idea to make the concept more expandable is exactly my point =) Mistral's Workbook is roughly the same as TaskFlow Flow and have nothing to do with TaskFlow LogBook. The only major difference in case of Workbook is that we have to store it after it was created using API or DSL. I very much like the idea to inherit form basic Flow and add serialization and additional metadata on top of it, but the concept of retrying within a Workbook\Flow is not exactly what we have in mind. Instead of repeating the Flow, we are repeating the Task (which has the same differences between Mistral and TaskFlow as Workbook has) and to use the trick you have proposed we would have to translate the Task into a Flow. How much sense does it make in respect to expandability and generality? Just one clarification. Workbook is a wider abstraction. Along with workflow itself it includes action definitions, triggers (even though it’s still controversial even within the team itself). Potentially it may include something else. So having a workbook in TaskFlow with strictly defined behaviour is not a good decision. Ya, I think u are hitting this issue, if a worker acks a message saying its 'working on it' then the worker dies, this is where a message queue ack model won't really help in solving, the engine will believe that the worker is working on it (since hey the worker acked it) but the engine will really have no idea that the worker is actually dead. Of course timeouts can be used to get around this but imho that’s really un-elegant when something like zookeeper (or similar) can be used to be notified of this worker death immediately; which means no timeouts are needed at all, and this separate process or engine can be notified of this death (and resolve it immediately). I don't believe this kind of 'liveness' connection is possible with rabbit (we also must think of other things besides rabbit, like qpid and such) but I might be wrong, I thought once a worker acks a message then whoever receives that ack will believe the work has started and will be finished someday in the future (aka, no connection that the work is actually in progress). Anyways, we can discuss this more since I think its a common confusion point :-) As far as i understand, what Renat is proposing here is not to acknowledge the message until it was successfully executed. I'm not sure how exactly RabbitMQ will react in that case, though you are right in the idea that we must investigate the other solutions more carefully. Joshua, I think your understanding of how it works is not exactly correct (or my understanding actually :)). Let’s talk in IRC, we’ll explain it. I’d suggest we talk in IRC on Thursday at 8.30 pm (Friday 10.30 in our timezone). Renat Akhmerov @ Mirantis Inc. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks
I'm trying to catch up this rather long and interesting discussion, sorry for somewhat late reply. I can see aspects of 'lazy model' support in TaskFlow: - how tasks are executed and reverted - how flows are run - how engine works internally Let me address those aspects separately. == Executing and reverting tasks == I think that should be done via different interface then running a flow (or scheduling it to run), as it is completely different thing. In current TaskFlow this interface is called task executor: https://github.com/openstack/taskflow/blob/master/taskflow/engines/action_engine/executor.py#L57 That is actually how our WorkerBasedEngine was implemented: it's the same engine with special task executor that schedules tasks on worker instead of running task code locally. Task executors are not aware of flows by design, all they do is executing and reverting tasks. That means that task executors can be easily shared between engines if that's wanted. Current TaskExecutorBase interface uses futures (PEP 3148-like). When I proposed it, futures looked like good tool for the task at hand (see e.g. async task etherpad https://etherpad.openstack.org/p/async-taskflow-tasks) Now it may be time to reconsider that: having one future object per running task may become a scalability issue. It may be worth to use callbacks instead. It should not be too hard to refactor current engine for that. Also, as TaskExecutorBase is an internal API, there should not be any compatibility issues. Then, we can make task executor interface public and allow clients to provide their own task executors. It will be possible then for Mistral to implement its own task executor, or several, and share the executors between all the engine instances. You can call it a plan;) == Running flows == To run the flow TaskFlow client uses engine interface; also, there are few of helper functions provided for convenience: http://docs.openstack.org/developer/taskflow/engines.html#module-taskflow.engines.base http://docs.openstack.org/developer/taskflow/engines.html#creating-engines That is part of our public API, it is stable and good enough. Basically, I don't think this API needs any major change. Maybe it worth to add function or method to schedule running flow without actually waiting for flow completion (at least, it was on my top secret TODO list for quite a long time). == Engine internals == Each engine eats resources, like thread it runs on; using these resources to run one flow only is somewhat wasteful. Some work is already planned to address this situation (see e.g. https://blueprints.launchpad.net/taskflow/+spec/share-engine-thread). Also, it might be good idea to implement different 'type' of engine to support 'lazy' model, as Joshua suggests. But whatever should and will be done about it, I daresay all that work can be done without affecting API more then I described above. -- WBR, Ivan A. Melnikov ... tasks must flow ... On 02.04.2014 01:51, Dmitri Zimine wrote: Even more responses inline :) [...] ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks
Thank Ivan, This does seem like a possible way forward also. I'd be interesting to see what/how callbacks would work vs. futures and what extension point we could provide to mistral for task execution (maybe there task executor would complete by doing a call to some service, not amqp for example?). Maybe some example/POC code would help all :-) -Josh -Original Message- From: Ivan Melnikov imelni...@griddynamics.com Date: Thursday, April 3, 2014 at 12:04 PM To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org, Joshua Harlow harlo...@yahoo-inc.com Subject: Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks I'm trying to catch up this rather long and interesting discussion, sorry for somewhat late reply. I can see aspects of 'lazy model' support in TaskFlow: - how tasks are executed and reverted - how flows are run - how engine works internally Let me address those aspects separately. == Executing and reverting tasks == I think that should be done via different interface then running a flow (or scheduling it to run), as it is completely different thing. In current TaskFlow this interface is called task executor: https://github.com/openstack/taskflow/blob/master/taskflow/engines/action_ engine/executor.py#L57 That is actually how our WorkerBasedEngine was implemented: it's the same engine with special task executor that schedules tasks on worker instead of running task code locally. Task executors are not aware of flows by design, all they do is executing and reverting tasks. That means that task executors can be easily shared between engines if that's wanted. Current TaskExecutorBase interface uses futures (PEP 3148-like). When I proposed it, futures looked like good tool for the task at hand (see e.g. async task etherpad https://etherpad.openstack.org/p/async-taskflow-tasks) Now it may be time to reconsider that: having one future object per running task may become a scalability issue. It may be worth to use callbacks instead. It should not be too hard to refactor current engine for that. Also, as TaskExecutorBase is an internal API, there should not be any compatibility issues. Then, we can make task executor interface public and allow clients to provide their own task executors. It will be possible then for Mistral to implement its own task executor, or several, and share the executors between all the engine instances. You can call it a plan;) == Running flows == To run the flow TaskFlow client uses engine interface; also, there are few of helper functions provided for convenience: http://docs.openstack.org/developer/taskflow/engines.html#module-taskflow. engines.base http://docs.openstack.org/developer/taskflow/engines.html#creating-engines That is part of our public API, it is stable and good enough. Basically, I don't think this API needs any major change. Maybe it worth to add function or method to schedule running flow without actually waiting for flow completion (at least, it was on my top secret TODO list for quite a long time). == Engine internals == Each engine eats resources, like thread it runs on; using these resources to run one flow only is somewhat wasteful. Some work is already planned to address this situation (see e.g. https://blueprints.launchpad.net/taskflow/+spec/share-engine-thread). Also, it might be good idea to implement different 'type' of engine to support 'lazy' model, as Joshua suggests. But whatever should and will be done about it, I daresay all that work can be done without affecting API more then I described above. -- WBR, Ivan A. Melnikov ... tasks must flow ... On 02.04.2014 01:51, Dmitri Zimine wrote: Even more responses inline :) [...] ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks
Then, we can make task executor interface public and allow clients to provide their own task executors. It will be possible then for Mistral to implement its own task executor, or several, and share the executors between all the engine instances. I'm afraid that if we start to tear apart the TaskFlow engine, it would quickly become a mess to support. Besides, the amount of things left to integrate after we throw out engine might be so low it proof the whole process of integration to be just nominal and we are back to square one. Any way, task execution is the part that least bothers me, both graph action and the engine itself is where the pain will be. That is part of our public API, it is stable and good enough. Basically, I don't think this API needs any major change. But whatever should and will be done about it, I daresay all that work can be done without affecting API more then I described above. I completely agree that we should not change the public API of the sync engine, especially the one in helpers. What we need is, on the contrary, a low level construct that would do the number of things i stated previously, but will be a part of public API of TaskFlow so we can be sure it would work exactly the same way it worked yesterday. -- Kirill Izotov пятница, 4 апреля 2014 г. в 2:04, Ivan Melnikov написал: I'm trying to catch up this rather long and interesting discussion, sorry for somewhat late reply. I can see aspects of 'lazy model' support in TaskFlow: - how tasks are executed and reverted - how flows are run - how engine works internally Let me address those aspects separately. == Executing and reverting tasks == I think that should be done via different interface then running a flow (or scheduling it to run), as it is completely different thing. In current TaskFlow this interface is called task executor: https://github.com/openstack/taskflow/blob/master/taskflow/engines/action_engine/executor.py#L57 That is actually how our WorkerBasedEngine was implemented: it's the same engine with special task executor that schedules tasks on worker instead of running task code locally. Task executors are not aware of flows by design, all they do is executing and reverting tasks. That means that task executors can be easily shared between engines if that's wanted. Current TaskExecutorBase interface uses futures (PEP 3148-like). When I proposed it, futures looked like good tool for the task at hand (see e.g. async task etherpad https://etherpad.openstack.org/p/async-taskflow-tasks) Now it may be time to reconsider that: having one future object per running task may become a scalability issue. It may be worth to use callbacks instead. It should not be too hard to refactor current engine for that. Also, as TaskExecutorBase is an internal API, there should not be any compatibility issues. Then, we can make task executor interface public and allow clients to provide their own task executors. It will be possible then for Mistral to implement its own task executor, or several, and share the executors between all the engine instances. You can call it a plan;) == Running flows == To run the flow TaskFlow client uses engine interface; also, there are few of helper functions provided for convenience: http://docs.openstack.org/developer/taskflow/engines.html#module-taskflow.engines.base http://docs.openstack.org/developer/taskflow/engines.html#creating-engines That is part of our public API, it is stable and good enough. Basically, I don't think this API needs any major change. Maybe it worth to add function or method to schedule running flow without actually waiting for flow completion (at least, it was on my top secret TODO list for quite a long time). == Engine internals == Each engine eats resources, like thread it runs on; using these resources to run one flow only is somewhat wasteful. Some work is already planned to address this situation (see e.g. https://blueprints.launchpad.net/taskflow/+spec/share-engine-thread). Also, it might be good idea to implement different 'type' of engine to support 'lazy' model, as Joshua suggests. But whatever should and will be done about it, I daresay all that work can be done without affecting API more then I described above. -- WBR, Ivan A. Melnikov ... tasks must flow ... On 02.04.2014 01:51, Dmitri Zimine wrote: Even more responses inline :) [...] ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org (mailto:OpenStack-dev@lists.openstack.org) http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks
On 02 Apr 2014, at 13:38, Kirill Izotov enyk...@stackstorm.com wrote: I agree that we probably need new engine for that kind of changes and, as Renat already said in another thread, lazy model seems to be more basic and it would be easier to build sync engine on top of that rather than other way around. Yes, it will entail a lot of changes in engines that are currently here, but it seems like the only way to get something that would fit us both. Since it seems like we are getting some kind of agreement here, we should probably start shifting to the plane where we discuss the design of the new engine, rather than the need of one. The idea has been spread over to many places, so i'll try to gather it back. Lazy engine should be async and atomic, it should not have its own state, instead it should rely on on some kind of global state (db or in-memory, depending on a type of application). I should have at least two methods: run and task_complete. Run method should calculate the first batch of tasks and schedule them to execution (either put them in queue or spawn the threads... or send a pidgin, i insist =). Task_complete should mark a certain task to be completed and then schedule the next batch of tasks that became available due to resolution of this one. Then, on top of it you can build sync engine by introducing Futures. You are using async.run() to schedule the tasks by transforming them to Futures and then starting a loop, checking Futures for completion and sending their results to async.task_complete() which would produce even more Futures to check over. Just the same way TaskFlow do it right now. On the Mistral side we are using Lazy engine by patching async.run to the API (or engine queue) and async.task_complete to the worker queue result channel (and the API for long running tasks). We still sharing the same graph_analyzer, but instead of relying on loop and Futures, we are handling execution in a scalable and robust way. The reason i'm proposing to extract Futures from async engine is because they won't work if we have multiple engines that should handle the task results concurrently (and without that there will be no scalability). I agree with all the major points here. It sounds like a plan. What i see is bothering Joshua is that worker that handles long running task may just die in a process and there is no way for us to know for sure is it still working or not. It is more important for sync engine because without that it would just hang forever (though it is not clear do sync engine needs long running tasks at all?). In case of sync engine, the role of watch-dog may be performed by the loop, while in case of Mistral it might be a separate process (though i bet there should be something for resending the message in the RabbitMQ itself). The common solution is not necessary here. As for not loosing messages (resending them to the queue) that’s why I mentioned message acknowledgement ([0]). Rabbit does it automatically if a message that has been pulled out of queue should be acknowledged and the corresponding rabbit client connection dies before the client acknowledges the message (at this point there’s a guarantee it’s been processed). It works this way now in Mistral. A slight problem with the current Mistral implementation is that engine works as follows when starting a workflow: 1. Start DB Transaction Make all smart logic and DB operations (updating persistent state of tasks and execution) Commit DB Transaction 2. Submit task messages to MQ. So there’s a window between steps 1 and 2 when if the system crashes DB state won’t be consistent with the state of the MQ. From a worker perspective looks like this problem doesn’t exist (at least we thought through it and didn’t find any synchronisation windows). Initially step 2 was a part of step 1 (a part of DB TX) and it would solve the problem but it’s generally a bad pattern to have interprocess communication inside DB TX so we got rid of it. So this seems to me the only problem with the current architecture, we left it as it was and were going to get back to it after POC. I assume this might be describing the reason why Josh keeps talking about ‘jobboard’ thing, but I just don’t have enough details at this point to say if that’s what we need or not. One solution I see here is just to consider this situation exceptional from workflow execution model perspective and apply policies to tasks that are persisted in DB but not submitted to MQ (what you guys discussed about retries and all that stuff..). So our watch-dog process could just resubmit tasks that have been hanging in DB longer than a configured period with the state IDLE. Something like that. [0] http://www.rabbitmq.com/tutorials/tutorial-two-python.html Renat Akhmerov @ Mirantis Inc. ___ OpenStack-dev mailing list
Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks
Even more responses inline :) On Mar 31, 2014, at 8:44 PM, Joshua Harlow harlo...@yahoo-inc.com wrote: More responses inline :) From: Dmitri Zimine d...@stackstorm.com Date: Monday, March 31, 2014 at 5:59 PM To: Joshua Harlow harlo...@yahoo-inc.com Cc: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks Inline... On Mar 27, 2014, at 5:10 PM, Joshua Harlow harlo...@yahoo-inc.com wrote: Thanks for the description! The steps here seem very much like what a taskflow engine does (which is good). To connect this to how I think could work in taskflow. Someone creates tasks/flows describing the work to-be-done (converting a DSL - taskflow tasks/flows/retry[1] objects…) On execute(workflow) engine creates a new workflow execution, computes the first batch of tasks, creates executor for those tasks (remote, local…) and executes those tasks. Waits for response back from futures returned from executor. Receives futures responses (or receives new response DELAY, for example), or exceptions… Continues sending out batches of tasks that can be still be executing (aka tasks that don't have dependency on output of delayed tasks). If any delayed tasks after repeating #2-5 as many times as it can, the engine will shut itself down (see http://tinyurl.com/l3x3rrb). Why would engine treat long running tasks differently? The model Mistral tried out is the engine sends the batch of tasks and goes asleep; the 'worker/executor' is calling engine back when the task(s) complete. Can it be applied Not all ways of using taskflow I think want to go 'asleep', some active conductors I would think actually stay 'awake', that’s why I was thinking this would be a comprimise, the usage of 'DELAY' (via exception or return) would allow a task to say it is going to be finished sometime in the future (btw started this @ https://review.openstack.org/#/c/84277/ --- WIP, likely won't go into taskflow 0.2 but seems like a reasonable start on something like this proposal). To me offering just the going asleep way means that all users of taskflow need to have some type of entrypoint (rest, mq, other…) that restarts the engine on result finishing. I'd rather not force that model onto all users (since that seems inappropriate). I understand and agree with not enforcing this model on all users. On a flip side, TaskFlow enforces the current model on Mistral :) It calls for supporting more models (to your earlier point of 3rd mode of operations). The BP proposal is a step in the direction but still a compromise. How can we support both 'active' and 'passive' models, sharing the same 'library' - of all things responsible for control flow and data flow? Here is a strawman (DISCLAIMER: this is sharing ideas and brainstorming: I don't know enough of TaksFlow to suggest): 1) break the run loop: add engine.run_async() method which instead of looping the tasks till the flow completes (http://tinyurl.com/nou2em9), only schedules the first batch and returns; https://github.com/openstack/taskflow/blob/master/taskflow/engines/action_engine/graph_action.py#L61-L68 2) introduce engine.task_complete(…) method that marks the current task complete, computes the next tasks ready for execution, schedules them and returns; loosely, doing this part: https://github.com/openstack/taskflow/blob/master/taskflow/engines/action_engine/graph_action.py#L75-L85 3) make executor signal engine back by calling engine.task_complete(…): executor.execute_task() submits the task to the worker queue and calls engine.task_complete(status, results, etc.) instead of waiting for task completion and returning the results. And looks like it all can be done as a third lazy_engine, without messing the current two engines? On delay task finishing some API/webhook/other (the mechanism imho shouldn't be tied to webhooks, at least not in taskflow, but should be left up to the user of taskflow to decide how to accomplish this) will be/must be responsible for resuming the engine and setting the result for the previous delayed task. Oh no, webhook is the way to expose it to 3rd party system. From the library standpoint it's just an API call. One can do it even now by getting the appropriate Flow_details, instantiating and engine (flow, flow_details) and running it to continue from where it left out. Is it how you mean it? But I keep on dreaming of a passive version of TaskFlow engine which treats all tasks the same and exposes one extra method - handle_tasks. I was thinking that, although I'm unsure on the one extra method idea, can u explain more :) What would handle_tasks do? Restart/resume the engine (basically the work u stated 'getting the appropriate Flow_details, instantiating and engine (flow, flow_details) and running
Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks
Cool, my thoughts added ;) From: Dmitri Zimine d...@stackstorm.commailto:d...@stackstorm.com Date: Tuesday, April 1, 2014 at 2:51 PM To: Joshua Harlow harlo...@yahoo-inc.commailto:harlo...@yahoo-inc.com Cc: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks Even more responses inline :) On Mar 31, 2014, at 8:44 PM, Joshua Harlow harlo...@yahoo-inc.commailto:harlo...@yahoo-inc.com wrote: More responses inline :) From: Dmitri Zimine d...@stackstorm.commailto:d...@stackstorm.com Date: Monday, March 31, 2014 at 5:59 PM To: Joshua Harlow harlo...@yahoo-inc.commailto:harlo...@yahoo-inc.com Cc: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks Inline... On Mar 27, 2014, at 5:10 PM, Joshua Harlow harlo...@yahoo-inc.commailto:harlo...@yahoo-inc.com wrote: Thanks for the description! The steps here seem very much like what a taskflow engine does (which is good). To connect this to how I think could work in taskflow. 1. Someone creates tasks/flows describing the work to-be-done (converting a DSL - taskflow tasks/flows/retry[1] objects…) 2. On execute(workflow) engine creates a new workflow execution, computes the first batch of tasks, creates executor for those tasks (remote, local…) and executes those tasks. 3. Waits for response back from futureshttp://docs.python.org/dev/library/concurrent.futures.html returned from executor. 4. Receives futures responses (or receives new response DELAY, for example), or exceptions… 5. Continues sending out batches of tasks that can be still be executing (aka tasks that don't have dependency on output of delayed tasks). 6. If any delayed tasks after repeating #2-5 as many times as it can, the engine will shut itself down (see http://tinyurl.com/l3x3rrb). Why would engine treat long running tasks differently? The model Mistral tried out is the engine sends the batch of tasks and goes asleep; the 'worker/executor' is calling engine back when the task(s) complete. Can it be applied Not all ways of using taskflow I think want to go 'asleep', some active conductors I would think actually stay 'awake', that’s why I was thinking this would be a comprimise, the usage of 'DELAY' (via exception or return) would allow a task to say it is going to be finished sometime in the future (btw started this @ https://review.openstack.org/#/c/84277/ --- WIP, likely won't go into taskflow 0.2 but seems like a reasonable start on something like this proposal). To me offering just the going asleep way means that all users of taskflow need to have some type of entrypoint (rest, mq, other…) that restarts the engine on result finishing. I'd rather not force that model onto all users (since that seems inappropriate). I understand and agree with not enforcing this model on all users. On a flip side, TaskFlow enforces the current model on Mistral :) It calls for supporting more models (to your earlier point of 3rd mode of operations). The BP proposal is a step in the direction but still a compromise. How can we support both 'active' and 'passive' models, sharing the same 'library' - of all things responsible for control flow and data flow? To me this is what engines 'types' are for, to support different execution models while using as many of the same components as possible (making usage as seamless as can be). Here is a strawman (DISCLAIMER: this is sharing ideas and brainstorming: I don't know enough of TaksFlow to suggest): 1) break the run loop: add engine.run_async() method which instead of looping the tasks till the flow completes (http://tinyurl.com/nou2em9), only schedules the first batch and returns; https://github.com/openstack/taskflow/blob/master/taskflow/engines/action_engine/graph_action.py#L61-L68 Seems like a possiblity, although I'd like to make this still a new engine type that does this (or maybe the existing engine types wrap this change to retain the usage that exists). This just seems like an extension of DELAY thought, where at each iteration the engine would return DELAYed, then the caller (the one executing) would become aware that this has happened and decide how to handle this accordingly. Exposing 'run_async' doesn't still feel right, maybe its just wording, but it seems like an 'async/lazy' engine just still has a run() method that returns DELAYed more often than an engine that loops and never returns DELAYed (or only returns DELAYed if a task causes this to happen). To me they are the same run() method, just different implementations of run(). Exposing the semantics of running by naming run_async exposes the underlying execution abstraction, which imho isn't
Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks
On 02 Apr 2014, at 05:45, Joshua Harlow harlo...@yahoo-inc.com wrote: Possibly, although to me this is still exposing the internal of engines to users who shouldn't care (or only care if they are specifying an engine type that gives them access to these details). Allowing public access to these API's worries me in that they are now the API (which goes back to having an engine type that exposes these, if that’s desired, and if we are willing to accept the consequences of exposing them). Given all we discussed by now calling it “internal of engines” is not correct anymore. If for some cases we know that only workers will be calling this API method and we need to protect the workflow execution from occasional calls from 3rd parties I believe there’s a million ways how to solve this. The simplest thing that comes to my mind is just passing a generated token to confirm authority to perform this operation. Who responds to the timeout though? Isn't that process the watchdog then? Likely the triggering of a timeout causes something to react (in both cases). The workflow engine IS this watch-dog (and in Mistral, engine is a single manager for all flow executions, in the prototype we call it engine_manager and I hate this name :)) Engine live in process or as a separate process. And it is passive - executed in a thread of a callee. E.g., in process is either called on messaging event handler thread, or by web method thread. Right, which in mistral is a web-server right (aka, wherever mistral is setup) since the tasks finish by calling a rest-endpoint (or something sitting on MQ?)? Not exactly right. Currently it’s a web server but we’re about to decouple engine and API server. Most of the work is done. Engine is supposed to listen to a queue and there may be any number of engines since they are stateless and hence what’s behind a web server can be scaled as needed. And actually a web server tier can be scaled easily too (assuming we have a loadbalancer in place). That may be good enough: when DSL is translated to flow, and the task demands repetition with timeout, it's ok to do this trick under the hood when compiling a flow. flow.add(LinearFlow(subblahblah, retry=XYZ).add(OneTask().add(Timeout(timeout)) Yup, in a way most languages compilers do all these types of tricks under the hood (in much much more complicated manners); as long as we retain 'user intention' (aka don't mess how the code executes) we should be able to do any tricks we want (in fact most compliers do many many tricks). To me the same kind of tricks start to become possible after we get the basics right (can't do optimizations, aka -O2, if u don't have basics in the first place). I’d be careful about this assumption that we can convert DSL to flow, right now it’s impossible since we need to add more control flow primitives in TaskFlow. But that’s what Kirill described in the prototype description. Renat Akhmerov @ Mirantis Inc.___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks
Inline... On Mar 27, 2014, at 5:10 PM, Joshua Harlow harlo...@yahoo-inc.com wrote: Thanks for the description! The steps here seem very much like what a taskflow engine does (which is good). To connect this to how I think could work in taskflow. Someone creates tasks/flows describing the work to-be-done (converting a DSL - taskflow tasks/flows/retry[1] objects…) On execute(workflow) engine creates a new workflow execution, computes the first batch of tasks, creates executor for those tasks (remote, local…) and executes those tasks. Waits for response back from futures returned from executor. Receives futures responses (or receives new response DELAY, for example), or exceptions… Continues sending out batches of tasks that can be still be executing (aka tasks that don't have dependency on output of delayed tasks). If any delayed tasks after repeating #2-5 as many times as it can, the engine will shut itself down (see http://tinyurl.com/l3x3rrb). Why would engine treat long running tasks differently? The model Mistral tried out is the engine sends the batch of tasks and goes asleep; the 'worker/executor' is calling engine back when the task(s) complete. Can it be applied On delay task finishing some API/webhook/other (the mechanism imho shouldn't be tied to webhooks, at least not in taskflow, but should be left up to the user of taskflow to decide how to accomplish this) will be/must be responsible for resuming the engine and setting the result for the previous delayed task. Oh no, webhook is the way to expose it to 3rd party system. From the library standpoint it's just an API call. One can do it even now by getting the appropriate Flow_details, instantiating and engine (flow, flow_details) and running it to continue from where it left out. Is it how you mean it? But I keep on dreaming of a passive version of TaskFlow engine which treats all tasks the same and exposes one extra method - handle_tasks. Repeat 2 - 7 until all tasks have executed/failed. Profit! This seems like it could be accomplished, although there are race conditions in the #6 (what if multiple delayed requests are received at the same time)? What locking is done to ensure that this doesn't cause conflicts? Engine must handle concurrent calls of mutation methods - start, stop, handle_action. How - differs depending on engine running in multiple threads or in event loop on queues of calls. Does the POC solve that part (no simultaneous step #5 from below)? Yes although we may want to revisit the current solution. There was a mention of a watch-dog (ideally to ensure that delayed tasks can't just sit around forever), was that implemented? If _delayed_ tasks and 'normal' tasks are treat alike, this is just a matter of timeout as a generic property on a task. So Mistral didn't have to have it. For the proposal above, a separate treatment is necessary for _delayed_ tasks. [1] https://wiki.openstack.org/wiki/TaskFlow#Retries (new feature!) This is nice. I would call it a 'repeater': running a sub flow several times with various data for various reasons is reacher then 'retry'. What about the 'retry policy' on individual task? From: Dmitri Zimine d...@stackstorm.com Reply-To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Date: Thursday, March 27, 2014 at 4:43 PM To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Subject: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks Following up on http://tinyurl.com/l8gtmsw and http://tinyurl.com/n3v9lt8: this explains how Mistral handles long running delegate tasks. Note that a 'passive' workflow engine can handle both normal tasks and delegates the same way. I'll also put that on ActionDesign wiki, after discussion. Diagram: https://docs.google.com/a/stackstorm.com/drawings/d/147_EpdatpN_sOLQ0LS07SWhaC3N85c95TkKMAeQ_a4c/edit?usp=sharing 1. On start(workflow), engine creates a new workflow execution, computes the first batch of tasks, sends them to ActionRunner [1]. 2. ActionRunner creates an action and calls action.run(input) 3. Action does the work (compute (10!)), produce the results, and return the results to executor. If it returns, status=SUCCESS. If it fails it throws exception, status=ERROR. 4. ActionRunner notifies Engine that the task is complete task_done(execution, task, status, results)[2] 5. Engine computes the next task(s) ready to trigger, according to control flow and data flow, and sends them to ActionRunner. 6. Like step 2: ActionRunner calls the action's run(input) 7. A delegate action doesn't produce results: it calls out the 3rd party system, which is expected to make a callback to a workflow service with the results. It returns to ActionRunner without results, immediately. 8. ActionRunner marks status=RUNNING [?] 9. 3rd party system
Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks
More responses inline :) From: Dmitri Zimine d...@stackstorm.commailto:d...@stackstorm.com Date: Monday, March 31, 2014 at 5:59 PM To: Joshua Harlow harlo...@yahoo-inc.commailto:harlo...@yahoo-inc.com Cc: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks Inline... On Mar 27, 2014, at 5:10 PM, Joshua Harlow harlo...@yahoo-inc.commailto:harlo...@yahoo-inc.com wrote: Thanks for the description! The steps here seem very much like what a taskflow engine does (which is good). To connect this to how I think could work in taskflow. 1. Someone creates tasks/flows describing the work to-be-done (converting a DSL - taskflow tasks/flows/retry[1] objects…) 2. On execute(workflow) engine creates a new workflow execution, computes the first batch of tasks, creates executor for those tasks (remote, local…) and executes those tasks. 3. Waits for response back from futureshttp://docs.python.org/dev/library/concurrent.futures.html returned from executor. 4. Receives futures responses (or receives new response DELAY, for example), or exceptions… 5. Continues sending out batches of tasks that can be still be executing (aka tasks that don't have dependency on output of delayed tasks). 6. If any delayed tasks after repeating #2-5 as many times as it can, the engine will shut itself down (see http://tinyurl.com/l3x3rrb). Why would engine treat long running tasks differently? The model Mistral tried out is the engine sends the batch of tasks and goes asleep; the 'worker/executor' is calling engine back when the task(s) complete. Can it be applied Not all ways of using taskflow I think want to go 'asleep', some active conductors I would think actually stay 'awake', that’s why I was thinking this would be a comprimise, the usage of 'DELAY' (via exception or return) would allow a task to say it is going to be finished sometime in the future (btw started this @ https://review.openstack.org/#/c/84277/ --- WIP, likely won't go into taskflow 0.2 but seems like a reasonable start on something like this proposal). To me offering just the going asleep way means that all users of taskflow need to have some type of entrypoint (rest, mq, other…) that restarts the engine on result finishing. I'd rather not force that model onto all users (since that seems inappropriate). 1. On delay task finishing some API/webhook/other (the mechanism imho shouldn't be tied to webhooks, at least not in taskflow, but should be left up to the user of taskflow to decide how to accomplish this) will be/must be responsible for resuming the engine and setting the result for the previous delayed task. Oh no, webhook is the way to expose it to 3rd party system. From the library standpoint it's just an API call. One can do it even now by getting the appropriate Flow_details, instantiating and engine (flow, flow_details) and running it to continue from where it left out. Is it how you mean it? But I keep on dreaming of a passive version of TaskFlow engine which treats all tasks the same and exposes one extra method - handle_tasks. I was thinking that, although I'm unsure on the one extra method idea, can u explain more :) What would handle_tasks do? Restart/resume the engine (basically the work u stated 'getting the appropriate Flow_details, instantiating and engine (flow, flow_details) and running it to continue')? Seems like a tiny helper function if its really wanted, but maybe I'm misunderstanding. It might be connected into a recent taskflow BP @ https://blueprints.launchpad.net/taskflow/+spec/generic-flow-conductor? 1. Repeat 2 - 7 until all tasks have executed/failed. 2. Profit! This seems like it could be accomplished, although there are race conditions in the #6 (what if multiple delayed requests are received at the same time)? What locking is done to ensure that this doesn't cause conflicts? Engine must handle concurrent calls of mutation methods - start, stop, handle_action. How - differs depending on engine running in multiple threads or in event loop on queues of calls. Agreed, to me this requires some level of locking, likely something that the toozhttps://review.openstack.org/#/c/71167/ library can provide (or in concept is similar to what the jobboardhttps://wiki.openstack.org/wiki/TaskFlow/Paradigm_shifts#Workflow_ownership_transfer concept provides, single 'atomic' engine access to a given workflow, ensuring this with a distributing locking scheme, such as zookeeper). Does the POC solve that part (no simultaneous step #5 from below)? Yes although we may want to revisit the current solution. Is it using some form of database locking? :( There was a mention of a watch-dog (ideally to ensure that delayed tasks can't just sit around forever), was that implemented? If _delayed_ tasks
[openstack-dev] [Mistral] How Mistral handling long running delegate tasks
Following up on http://tinyurl.com/l8gtmsw and http://tinyurl.com/n3v9lt8: this explains how Mistral handles long running delegate tasks. Note that a 'passive' workflow engine can handle both normal tasks and delegates the same way. I'll also put that on ActionDesign wiki, after discussion. Diagram: https://docs.google.com/a/stackstorm.com/drawings/d/147_EpdatpN_sOLQ0LS07SWhaC3N85c95TkKMAeQ_a4c/edit?usp=sharing 1. On start(workflow), engine creates a new workflow execution, computes the first batch of tasks, sends them to ActionRunner [1]. 2. ActionRunner creates an action and calls action.run(input) 3. Action does the work (compute (10!)), produce the results, and return the results to executor. If it returns, status=SUCCESS. If it fails it throws exception, status=ERROR. 4. ActionRunner notifies Engine that the task is complete task_done(execution, task, status, results)[2] 5. Engine computes the next task(s) ready to trigger, according to control flow and data flow, and sends them to ActionRunner. 6. Like step 2: ActionRunner calls the action's run(input) 7. A delegate action doesn't produce results: it calls out the 3rd party system, which is expected to make a callback to a workflow service with the results. It returns to ActionRunner without results, immediately. 8. ActionRunner marks status=RUNNING [?] 9. 3rd party system takes 'long time' == longer then any system component can be assumed to stay alive. 10. 3rd party component calls Mistral WebHook which resolves to engine.task_complete(workbook, id, status, results) Comments: * One Engine handles multiple executions of multiple workflows. It exposes two main operations: start(workflow) and task_complete(execution, task, status, results), and responsible for defining the next batch of tasks based on control flow and data flow. Engine is passive - it runs in a hosts' thread. Engine and ActionRunner communicate via task queues asynchronously, for details, see https://wiki.openstack.org/wiki/Mistral/POC * Engine doesn't distinct sync and async actions, it doesn't deal with Actions at all. It only reacts to task completions, handling the results, updating the state, and queuing next set of tasks. * Only Action can know and define if it is a delegate or not. Some protocol required to let ActionRunner know that the action is not returning the results immediately. A convention of returning None may be sufficient. * Mistral exposes engine.task_done in the REST API so 3rd party systems can call a web hook. DZ. [1] I use ActionRunner instead of Executor (current name) to avoid confusion: it is Engine which is responsible for executions, and ActionRunner only runs actions. We should rename it in the code. [2] I use task_done for briefly and out of pure spite, in the code it is conveny_task_results.___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks
Thanks for the description! The steps here seem very much like what a taskflow engine does (which is good). To connect this to how I think could work in taskflow. 1. Someone creates tasks/flows describing the work to-be-done (converting a DSL - taskflow tasks/flows/retry[1] objects…) 2. On execute(workflow) engine creates a new workflow execution, computes the first batch of tasks, creates executor for those tasks (remote, local…) and executes those tasks. 3. Waits for response back from futureshttp://docs.python.org/dev/library/concurrent.futures.html returned from executor. 4. Receives futures responses (or receives new response DELAY, for example), or exceptions… 5. Continues sending out batches of tasks that can be still be executing (aka tasks that don't have dependency on output of delayed tasks). 6. If any delayed tasks after repeating #2-5 as many times as it can, the engine will shut itself down (see http://tinyurl.com/l3x3rrb). 7. On delay task finishing some API/webhook/other (the mechanism imho shouldn't be tied to webhooks, at least not in taskflow, but should be left up to the user of taskflow to decide how to accomplish this) will be/must be responsible for resuming the engine and setting the result for the previous delayed task. 8. Repeat 2 - 7 until all tasks have executed/failed. 9. Profit! This seems like it could be accomplished, although there are race conditions in the #6 (what if multiple delayed requests are received at the same time)? What locking is done to ensure that this doesn't cause conflicts? Does the POC solve that part (no simultaneous step #5 from below)? There was a mention of a watch-dog (ideally to ensure that delayed tasks can't just sit around forever), was that implemented? [1] https://wiki.openstack.org/wiki/TaskFlow#Retries (new feature!) From: Dmitri Zimine d...@stackstorm.commailto:d...@stackstorm.com Reply-To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org Date: Thursday, March 27, 2014 at 4:43 PM To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org Subject: [openstack-dev] [Mistral] How Mistral handling long running delegate tasks Following up on http://tinyurl.com/l8gtmsw and http://tinyurl.com/n3v9lt8: this explains how Mistral handles long running delegate tasks. Note that a 'passive' workflow engine can handle both normal tasks and delegates the same way. I'll also put that on ActionDesign wiki, after discussion. Diagram: https://docs.google.com/a/stackstorm.com/drawings/d/147_EpdatpN_sOLQ0LS07SWhaC3N85c95TkKMAeQ_a4c/edit?usp=sharing 1. On start(workflow), engine creates a new workflow execution, computes the first batch of tasks, sends them to ActionRunner [1]. 2. ActionRunner creates an action and calls action.run(input) 3. Action does the work (compute (10!)), produce the results, and return the results to executor. If it returns, status=SUCCESS. If it fails it throws exception, status=ERROR. 4. ActionRunner notifies Engine that the task is complete task_done(execution, task, status, results)[2] 5. Engine computes the next task(s) ready to trigger, according to control flow and data flow, and sends them to ActionRunner. 6. Like step 2: ActionRunner calls the action's run(input) 7. A delegate action doesn't produce results: it calls out the 3rd party system, which is expected to make a callback to a workflow service with the results. It returns to ActionRunner without results, immediately. 8. ActionRunner marks status=RUNNING [?] 9. 3rd party system takes 'long time' == longer then any system component can be assumed to stay alive. 10. 3rd party component calls Mistral WebHook which resolves to engine.task_complete(workbook, id, status, results) Comments: * One Engine handles multiple executions of multiple workflows. It exposes two main operations: start(workflow) and task_complete(execution, task, status, results), and responsible for defining the next batch of tasks based on control flow and data flow. Engine is passive - it runs in a hosts' thread. Engine and ActionRunner communicate via task queues asynchronously, for details, see https://wiki.openstack.org/wiki/Mistral/POC * Engine doesn't distinct sync and async actions, it doesn't deal with Actions at all. It only reacts to task completions, handling the results, updating the state, and queuing next set of tasks. * Only Action can know and define if it is a delegate or not. Some protocol required to let ActionRunner know that the action is not returning the results immediately. A convention of returning None may be sufficient. * Mistral exposes engine.task_done in the REST API so 3rd party systems can call a web hook. DZ. [1] I use ActionRunner instead of Executor (current name) to avoid confusion: it is Engine which is responsible