Hi Paul, Yes, that definitely helps, thanks for sharing.
What I'm thinking might work for me is this: 1. Write all incoming data to a landing directory. 2. Have a workflow that runs every, say, minute that does the following steps: (a) moves all completed files to a processing directory (b) executes the processing on the data in that directory. I don't *think* there should be any data duplication, and a small amount of data loss would probably be tolerable. I'd like this to be as near real-time as is feasible, but a small lag (a couple of minutes) is acceptable. I'd expect the data arrival to be lumpy, so nothing for long periods (e.g. overnight), a steady trickle at some times and lots during specific events. The potential flaw in this is: what happens if the workflow takes longer to execute than the period between executions? Does oozie spawn a second workflow, restart the workflow as soon as the first finishes, or just skip execution until the next period? If the first, would using a trigger file work as a control element? e.g. first check if the trigger file exists - if not, stop the workflow; if it does, delete it and execute the workflow, creating a new trigger file as the final step? Thanks, Charles On 6 September 2014 20:09, Paul Chavez <[email protected]> wrote: > I can offer some insights based on my experiences processing flume sourced > data with oozie. > > You are correct in that if you try to use data dependencies and flume > event timestamp bucketing your workflows will trigger as soon as the first > event in that partition arrives. Our initial architecture would write to > hourly buckets and the coordinator was configured to materialize an > instance of the dataset for the *previous* hour. It worked fairly well as > long as everything was up and the latency from event generation to written > in HDFS was small. There was also the assumption that data never arrived > out of order and that when a given hourly directory was created no more > data would arrive in the one being processed. We would reprocess the entire > dataset again every night to catch missing data. Ultimately, this approach > was abandoned as it did not handle the existence of duplicate data well and > was sometimes difficult to get re-started after any significant gap in data > written to HDFS. I included it here because it did work for the most part > and is easy to setup. > > What we do now is write continuously to an 'incoming' directory for each > unique data stream. For a given incoming directory a coordinator will > execute a workflow every 15 minutes or hour. The workflow will first copy > any completed files out of 'incoming' into a 'to_process' directory and use > that as an input to a deduplication workflow. Whereas before each bucket > corresponded to actual event generation time, they now correspond to the > time the deduplication batch was processed. Depending on the data we have a > number of downstream coordinators that use dataset dependencies based on > the output of the dedupe. > > Hope that helps, > Paul Chavez > > > > -----Original Message----- > From: Charles Robertson [mailto:[email protected]] > Sent: Saturday, September 06, 2014 10:39 AM > To: [email protected] > Subject: Best way to trigger oozie workflow? > > Hi all, > > I'm using flume to collect tweets, and I want to process the files > generated by flume as soon as possible after they arrive. > > What is the best way to achieve this? > > This is the best explanation of the different ways I have seen so far: > https://github.com/yahoo/oozie/wiki/Oozie-Coord-Use-Cases > > Flume can generate data directories (based on e.g. hour, minute, etc) but > my reading is that oozie will try and process it the moment the directory > appears. I'm not sure basing it on the files appearing would work any > better, either (unless it's possible to use wild cards in the file name?) > > It's also quite possible more data will arrive while the workflow is > executing, so that needs to be handled appropriately without skipping or > re-processing data. > > Any advice or links to tutorials/blogs/guides would be appreciated! > > Thanks, > Charles >
