I can offer some insights based on my experiences processing flume sourced data with oozie.
You are correct in that if you try to use data dependencies and flume event timestamp bucketing your workflows will trigger as soon as the first event in that partition arrives. Our initial architecture would write to hourly buckets and the coordinator was configured to materialize an instance of the dataset for the *previous* hour. It worked fairly well as long as everything was up and the latency from event generation to written in HDFS was small. There was also the assumption that data never arrived out of order and that when a given hourly directory was created no more data would arrive in the one being processed. We would reprocess the entire dataset again every night to catch missing data. Ultimately, this approach was abandoned as it did not handle the existence of duplicate data well and was sometimes difficult to get re-started after any significant gap in data written to HDFS. I included it here because it did work for the most part and is easy to setup. What we do now is write continuously to an 'incoming' directory for each unique data stream. For a given incoming directory a coordinator will execute a workflow every 15 minutes or hour. The workflow will first copy any completed files out of 'incoming' into a 'to_process' directory and use that as an input to a deduplication workflow. Whereas before each bucket corresponded to actual event generation time, they now correspond to the time the deduplication batch was processed. Depending on the data we have a number of downstream coordinators that use dataset dependencies based on the output of the dedupe. Hope that helps, Paul Chavez -----Original Message----- From: Charles Robertson [mailto:[email protected]] Sent: Saturday, September 06, 2014 10:39 AM To: [email protected] Subject: Best way to trigger oozie workflow? Hi all, I'm using flume to collect tweets, and I want to process the files generated by flume as soon as possible after they arrive. What is the best way to achieve this? This is the best explanation of the different ways I have seen so far: https://github.com/yahoo/oozie/wiki/Oozie-Coord-Use-Cases Flume can generate data directories (based on e.g. hour, minute, etc) but my reading is that oozie will try and process it the moment the directory appears. I'm not sure basing it on the files appearing would work any better, either (unless it's possible to use wild cards in the file name?) It's also quite possible more data will arrive while the workflow is executing, so that needs to be handled appropriately without skipping or re-processing data. Any advice or links to tutorials/blogs/guides would be appreciated! Thanks, Charles
