I can offer some insights based on my experiences processing flume sourced data 
with oozie.

You are correct in that if you try to use data dependencies and flume event 
timestamp bucketing your workflows will trigger as soon as the first event in 
that partition arrives. Our initial architecture would write to hourly buckets 
and the coordinator was configured to materialize an instance of the dataset 
for the *previous* hour. It worked fairly well as long as everything was up and 
the latency from event generation to written in HDFS was small. There was also 
the assumption that data never arrived out of order and that when a given 
hourly directory was created no more data would arrive in the one being 
processed. We would reprocess the entire dataset again every night to catch 
missing data. Ultimately, this approach was abandoned as it did not handle the 
existence of duplicate data well and was sometimes difficult to get re-started 
after any significant gap in data written to HDFS. I included it here because 
it did work for the most part and is easy to setup.

What we do now is write continuously to an 'incoming' directory for each unique 
data stream. For a given incoming directory a coordinator will execute a 
workflow every 15 minutes or hour. The workflow will first copy any completed 
files out of 'incoming' into a 'to_process' directory and use that as an input 
to a deduplication workflow. Whereas before each bucket corresponded to actual 
event generation time, they now correspond to the time the deduplication batch 
was processed. Depending on the data we have a number of downstream 
coordinators that use dataset dependencies based on the output of the dedupe.

Hope that helps,
Paul Chavez



-----Original Message-----
From: Charles Robertson [mailto:[email protected]] 
Sent: Saturday, September 06, 2014 10:39 AM
To: [email protected]
Subject: Best way to trigger oozie workflow?

Hi all,

I'm using flume to collect tweets, and I want to process the files generated by 
flume as soon as possible after they arrive.

What is the best way to achieve this?

This is the best explanation of the different ways I have seen so far:
https://github.com/yahoo/oozie/wiki/Oozie-Coord-Use-Cases

Flume can generate data directories (based on e.g. hour, minute, etc) but my 
reading is that oozie will try and process it the moment the directory appears. 
I'm not sure basing it on the files appearing would work any better, either 
(unless it's possible to use wild cards in the file name?)

It's also quite possible more data will arrive while the workflow is executing, 
so that needs to be handled appropriately without skipping or re-processing 
data.

Any advice or links to tutorials/blogs/guides would be appreciated!

Thanks,
Charles

Reply via email to