This is what I refer as sharding, it can be seen as a special type of fork/join where all shards are doing the same actions on different datasets and the number of shards depends on the number of datasets.
A while ago I've rewritten the workflow lib, cleanning it up a bit and adding this capability. But never got completed. If there is interest we could create an umbrella JIRA and complete the integration. Thanks. On Thu, Feb 20, 2014 at 1:47 PM, Mona Chitnis <[email protected]> wrote: > If you use the sub-workflow construct, then it would do some error > reporting for you. If a sub-workflow fails, the parent workflow also gets > updated to failed. Also in Oozie 4.0, the JIRA OOZIE-1264 The "parent" > property of a subworkflow should be the ID of the parent workflow, helps > get the dependency graph using IDs. > > > On 2/20/14, 12:52 PM, "Heller, Chris" <[email protected]> wrote: > > >Mona, > > > >Thanks. That is the road I'm headed down. At the moment. > > > >I'll create a Java action which takes the files (or a path glob -- or > >something) as input, and create multiple Oozie tasks based on that input, > >and then 'wait' for those tasks to complete. > > > >A feature like this built into the workflow certainly would be nice, since > >it would better integrate error handling I think. > > > >-Chris > > > >On 2/20/14, 3:43 PM, "Mona Chitnis" <[email protected]> wrote: > > > >>Hi Chris, > >> > >>There isn¹t a way of dynamic parallel tasks within the same Oozie > >>workflow > >>XML currently. But you can do some programmatically. Using Oozie Java > >>API, > >>you can start a dynamic number of sub-workflows based on the number of > >>outputs. > >> > >> > >>On 2/20/14, 7:05 AM, "Heller, Chris" <[email protected]> wrote: > >> > >>>Hi, > >>> > >>>I¹m trying to figure out the best way to implement a workflow in Oozie. > >>> > >>>I am creating a workflow which splits an input into multiple outputs. > >>> > >>>Then for each output I want to run another process over each. > >>> > >>>The trouble is I cannot know a-priori how many outputs I will have, and > >>>so to post process each I don¹t see how to setup a workflow to run the > >>>next stage. > >>> > >>>Ideally the next stage would be a fork/join type of scenario, since each > >>>output can be processed independently. But there isn¹t any way I can see > >>>to setup the fork paths without using some sort of XML generation > >>>preprocessor. > >>> > >>>Does anyone have a suggestion of how to proceed? Am I stuck doing > >>>workflow generation? Or is there another way to structure this workflow > >>>using the existing primitives? > >>> > >>>Thanks, > >>>Chris > >> > >
