If a workflow takes longer than the coordination interval to execute then the new workflow will be created and put into 'Waiting' state by default. There are concurrency settings that can allow more than one workflow to execute at a time. Since you will be moving data into a processing directoy, you can run with concurrency greater than one if the processing directory is unique per workflow instance.
-----Original Message----- From: Charles Robertson [mailto:[email protected]] Sent: Sunday, September 07, 2014 4:25 AM To: [email protected] Subject: Re: Best way to trigger oozie workflow? Hi Paul, Yes, that definitely helps, thanks for sharing. What I'm thinking might work for me is this: 1. Write all incoming data to a landing directory. 2. Have a workflow that runs every, say, minute that does the following steps: (a) moves all completed files to a processing directory (b) executes the processing on the data in that directory. I don't *think* there should be any data duplication, and a small amount of data loss would probably be tolerable. I'd like this to be as near real-time as is feasible, but a small lag (a couple of minutes) is acceptable. I'd expect the data arrival to be lumpy, so nothing for long periods (e.g. overnight), a steady trickle at some times and lots during specific events. The potential flaw in this is: what happens if the workflow takes longer to execute than the period between executions? Does oozie spawn a second workflow, restart the workflow as soon as the first finishes, or just skip execution until the next period? If the first, would using a trigger file work as a control element? e.g. first check if the trigger file exists - if not, stop the workflow; if it does, delete it and execute the workflow, creating a new trigger file as the final step? Thanks, Charles On 6 September 2014 20:09, Paul Chavez <[email protected]> wrote: > I can offer some insights based on my experiences processing flume > sourced data with oozie. > > You are correct in that if you try to use data dependencies and flume > event timestamp bucketing your workflows will trigger as soon as the > first event in that partition arrives. Our initial architecture would > write to hourly buckets and the coordinator was configured to > materialize an instance of the dataset for the *previous* hour. It > worked fairly well as long as everything was up and the latency from > event generation to written in HDFS was small. There was also the > assumption that data never arrived out of order and that when a given > hourly directory was created no more data would arrive in the one > being processed. We would reprocess the entire dataset again every > night to catch missing data. Ultimately, this approach was abandoned > as it did not handle the existence of duplicate data well and was > sometimes difficult to get re-started after any significant gap in > data written to HDFS. I included it here because it did work for the most > part and is easy to setup. > > What we do now is write continuously to an 'incoming' directory for > each unique data stream. For a given incoming directory a coordinator > will execute a workflow every 15 minutes or hour. The workflow will > first copy any completed files out of 'incoming' into a 'to_process' > directory and use that as an input to a deduplication workflow. > Whereas before each bucket corresponded to actual event generation > time, they now correspond to the time the deduplication batch was > processed. Depending on the data we have a number of downstream > coordinators that use dataset dependencies based on the output of the dedupe. > > Hope that helps, > Paul Chavez > > > > -----Original Message----- > From: Charles Robertson [mailto:[email protected]] > Sent: Saturday, September 06, 2014 10:39 AM > To: [email protected] > Subject: Best way to trigger oozie workflow? > > Hi all, > > I'm using flume to collect tweets, and I want to process the files > generated by flume as soon as possible after they arrive. > > What is the best way to achieve this? > > This is the best explanation of the different ways I have seen so far: > https://github.com/yahoo/oozie/wiki/Oozie-Coord-Use-Cases > > Flume can generate data directories (based on e.g. hour, minute, etc) > but my reading is that oozie will try and process it the moment the > directory appears. I'm not sure basing it on the files appearing would > work any better, either (unless it's possible to use wild cards in the > file name?) > > It's also quite possible more data will arrive while the workflow is > executing, so that needs to be handled appropriately without skipping > or re-processing data. > > Any advice or links to tutorials/blogs/guides would be appreciated! > > Thanks, > Charles >
