Hi Paul,

Yes, that definitely helps, thanks for sharing.

What I'm thinking might work for me is this:
1. Write all incoming data to a landing directory.
2. Have a workflow that runs every, say, minute that does the following
steps:
(a) moves all completed files to a processing directory
(b) executes the processing on the data in that directory.

I don't *think* there should be any data duplication, and a small amount of
data loss would probably be tolerable. I'd like this to be as near
real-time as is feasible, but a small lag (a couple of minutes) is
acceptable. I'd expect the data arrival to be lumpy, so nothing for long
periods (e.g. overnight), a steady trickle at some times and lots during
specific events.

The potential flaw in this is: what happens if the workflow takes longer to
execute than the period between executions? Does oozie spawn a second
workflow, restart the workflow as soon as the first finishes, or just skip
execution until the next period? If the first, would using a trigger file
work as a control element? e.g. first check if the trigger file exists - if
not, stop the workflow; if it does, delete it and execute the workflow,
creating a new trigger file as the final step?

Thanks,
Charles


On 6 September 2014 20:09, Paul Chavez <[email protected]> wrote:

> I can offer some insights based on my experiences processing flume sourced
> data with oozie.
>
> You are correct in that if you try to use data dependencies and flume
> event timestamp bucketing your workflows will trigger as soon as the first
> event in that partition arrives. Our initial architecture would write to
> hourly buckets and the coordinator was configured to materialize an
> instance of the dataset for the *previous* hour. It worked fairly well as
> long as everything was up and the latency from event generation to written
> in HDFS was small. There was also the assumption that data never arrived
> out of order and that when a given hourly directory was created no more
> data would arrive in the one being processed. We would reprocess the entire
> dataset again every night to catch missing data. Ultimately, this approach
> was abandoned as it did not handle the existence of duplicate data well and
> was sometimes difficult to get re-started after any significant gap in data
> written to HDFS. I included it here because it did work for the most part
> and is easy to setup.
>
> What we do now is write continuously to an 'incoming' directory for each
> unique data stream. For a given incoming directory a coordinator will
> execute a workflow every 15 minutes or hour. The workflow will first copy
> any completed files out of 'incoming' into a 'to_process' directory and use
> that as an input to a deduplication workflow. Whereas before each bucket
> corresponded to actual event generation time, they now correspond to the
> time the deduplication batch was processed. Depending on the data we have a
> number of downstream coordinators that use dataset dependencies based on
> the output of the dedupe.
>
> Hope that helps,
> Paul Chavez
>
>
>
> -----Original Message-----
> From: Charles Robertson [mailto:[email protected]]
> Sent: Saturday, September 06, 2014 10:39 AM
> To: [email protected]
> Subject: Best way to trigger oozie workflow?
>
> Hi all,
>
> I'm using flume to collect tweets, and I want to process the files
> generated by flume as soon as possible after they arrive.
>
> What is the best way to achieve this?
>
> This is the best explanation of the different ways I have seen so far:
> https://github.com/yahoo/oozie/wiki/Oozie-Coord-Use-Cases
>
> Flume can generate data directories (based on e.g. hour, minute, etc) but
> my reading is that oozie will try and process it the moment the directory
> appears. I'm not sure basing it on the files appearing would work any
> better, either (unless it's possible to use wild cards in the file name?)
>
> It's also quite possible more data will arrive while the workflow is
> executing, so that needs to be handled appropriately without skipping or
> re-processing data.
>
> Any advice or links to tutorials/blogs/guides would be appreciated!
>
> Thanks,
> Charles
>

Reply via email to