Hi, Oozie currently supports data-trigger where ALL dependent data-sets are available. There is no way of specifying that if M out of N are available.
This could be a new feature. In that case, can you define a generalized way of defining this? As a possible work around, missing data producer could create a file with length 0. Regards, Mohammad ________________________________ From: lance <[email protected]> To: [email protected] Sent: Wednesday, May 8, 2013 4:20 PM Subject: Gaps in dataset prevents dependency Would like a more verbose way to define the input-event - and the data-in. I'm trying to find ways to handle when a start/end instance isn't satisfied, event thought there is data to be processed. An example of this is when I'm parsing a set of 24 hours of logs, and there may be an hour at night that doesn't have anything produced. This use case is exacerbated when we are talking minutes and doing hourly rollups - but same scenario. Here is the example config: The coordinator runs every 5 minutes: <coordinator-app name="cef-workflow-coordinator" frequency="${coord:minutes(5)}" start="${jobStart}" end="${jobEnd}" In this case the input dataset is produced in minutes: <dataset name="logs" frequency="${coord:minutes(1)}" initial-instance="${initialDataset}" timezone="America/Los_Angeles"> <uri-template>${rootPath}/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}</uri-template> <done-flag></done-flag> </dataset> Using the example from the twitter/flume CDH example, the actual indicator that the job should be executed is when a new directory is created that is in the next set of data: <input-events> <data-in name="input" dataset="logs"> <start-instance>${coord:current((coord:tzOffset())-5)}</start-instance> <end-instance>${coord:current(coord:tzOffset())}</end-instance> <!-- <instance>${coord:current(coord:tzOffset())}</instance> --> </data-in> <data-in name="readyIndicator" dataset="logs"> <instance>${coord:current(1 + (coord:tzOffset()))}</instance> </data-in> </input-events> Would like it to be that a directory got created that is in the future from this dataset (the trigger), and then take whatever is available in the last 5 datasets (minutes). Instead, if there is a gap in the previous 5 minutes (say no logs came in during minute t-3) then the dependency is never fulfilled. Thanks for the help
