Hi Andrew, We currently handle a very similar use case in our ETL pipelines. Our data is partitioned in HDFS daily, but the data files can arrive with lags of several hours. We have split our Oozie workflows into two, one for backfilling using a deterministic *daily* dataset view into the data, and one in which each *hourly* iteration tails HDFS for the files newer than a checkpoint timestamp. The tailing is done with a custom class we wrote.
Because files can be indexed to prior dates when near the day boundary, our tailing code is configured to look back N number of time units, .ie 6 days, so even with significant lag we still pick up files that show up later. The capped lookback period keeps the HDFS search space limited since there is a max to the number of files generated for any given day. Hope that helps, Alain On Fri, May 15, 2015 at 3:56 PM, Mohammad Islam <[email protected]> wrote: > Whatever Idris said is correct. > I know Alain did some work for similar problem. He may give some more > details. > Regards,Mohammad > > > > On Thursday, May 14, 2015 9:32 PM, Idris Ali <[email protected]> > wrote: > > > Hi Andrew, > > If I understand correctly, it is about late data processing of events. > AFAIK there is no easy way to achieve this directly in oozie. Best you can > delay the processing by specifying > <start-instance>${coord:current(-5)}</start-instance> > assuming your data may arrive/change upto 5 hours. > > However, Apache Falcon has already solved these kind of feed processing > problem. > Check Late data section in > http://falcon.apache.org/0.6-incubating/EntitySpecification.html > > Thanks, > -Idris > > > > > On Fri, May 15, 2015 at 2:41 AM, Andrew Mains <[email protected]> > wrote: > > > Hi all, > > > > I'm hoping to get some input on a redesign of the way we run our data > > pipeline with oozie. We have a use case where we frequently get delayed > > data after we've processed a particular time window--that is, we can run > a > > workflow on a given hour of data, receive new input for that hour, and > then > > need to reprocess the hour. To give a more concrete example, say we have > a > > coordinator application with inputs and data sets: > > > > <datasets> > > <dataset name="input1" frequency="60" > > initial-instance="2015-05-14T19:00Z" timezone="UTC"> > > > > > <uri-template>${hdfs}/tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> > > </dataset> > > </datasets> > > <input-events> > > <data-in name="coordInput1" dataset="input1"> > > <start-instance>${coord:current(-1)}</start-instance> > > <end-instance>${coord:current(0)}</end-instance> > > </data-in> > > </input-events> > > > > If the coordinator system sees a done-flag for 2015-05-14T19:00:00 in > > revenue_feed, it will kick off its job. However, if new data comes in to > > revenue_feed, it won't kick off another job to handle it (afaik). As a > > result, the downstream datasets from this coordinator will remain out of > > date. > > > > Does oozie provide any means for handling this kind of scenario? As far > as > > I can tell, for a given coordinator, once an hour is processed, it > remains > > processed, and the coordinator system won't rerun it, even if new input > > data comes in--is that understanding correct? > > > > Thank you very much for your help! > > > > Andrew > > > > >
