Whatever Idris said is correct.
I know Alain did some work for similar problem. He may give some more details.
Regards,Mohammad
On Thursday, May 14, 2015 9:32 PM, Idris Ali <[email protected]> wrote:
Hi Andrew,
If I understand correctly, it is about late data processing of events.
AFAIK there is no easy way to achieve this directly in oozie. Best you can
delay the processing by specifying
<start-instance>${coord:current(-5)}</start-instance>
assuming your data may arrive/change upto 5 hours.
However, Apache Falcon has already solved these kind of feed processing
problem.
Check Late data section in
http://falcon.apache.org/0.6-incubating/EntitySpecification.html
Thanks,
-Idris
On Fri, May 15, 2015 at 2:41 AM, Andrew Mains <[email protected]>
wrote:
> Hi all,
>
> I'm hoping to get some input on a redesign of the way we run our data
> pipeline with oozie. We have a use case where we frequently get delayed
> data after we've processed a particular time window--that is, we can run a
> workflow on a given hour of data, receive new input for that hour, and then
> need to reprocess the hour. To give a more concrete example, say we have a
> coordinator application with inputs and data sets:
>
> <datasets>
> <dataset name="input1" frequency="60"
> initial-instance="2015-05-14T19:00Z" timezone="UTC">
>
> <uri-template>${hdfs}/tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template>
> </dataset>
> </datasets>
> <input-events>
> <data-in name="coordInput1" dataset="input1">
> <start-instance>${coord:current(-1)}</start-instance>
> <end-instance>${coord:current(0)}</end-instance>
> </data-in>
> </input-events>
>
> If the coordinator system sees a done-flag for 2015-05-14T19:00:00 in
> revenue_feed, it will kick off its job. However, if new data comes in to
> revenue_feed, it won't kick off another job to handle it (afaik). As a
> result, the downstream datasets from this coordinator will remain out of
> date.
>
> Does oozie provide any means for handling this kind of scenario? As far as
> I can tell, for a given coordinator, once an hour is processed, it remains
> processed, and the coordinator system won't rerun it, even if new input
> data comes in--is that understanding correct?
>
> Thank you very much for your help!
>
> Andrew
>