I think the generalization is to separate the trigger and the data set.  
I.e. Trigger can be just how it is today - data files, exitence of directories, 
but, 
the dataset could be defined separately as all data within a range (or however 
it wants to be defined). In this case there is not strict fulfillment, just any 
data that currently exists that in a range for instance.


As far as missing data producer creating a 0 length file - this is flume - so I 
would think that this would be a pretty standard use case. Even the example in 
CDH using twitter days is using flume to pull in tweets to an oozie workflow - 
yet says that they had to do a hack in order to trigger because there was no 
good way in flume.  Would like to see a better way of doing this.   

I think that I have a typical use case where logs are getting collected, and 
batch loader jobs want to be executed fairly frequently (say every 5 minutes). 
However, a hiccup in the network, or slow times at night, might leave gaps - or 
even at startup time there may be logs that start somewhere between the cutoffs.




On May 8, 2013, at 5:09 PM, Mohammad Islam <[email protected]> wrote:

> Hi,
> Oozie currently supports data-trigger where ALL dependent data-sets are 
> available. There is no way of specifying that if M out of N are available.
> 
> This could be a new feature. In that case, can you define a generalized way 
> of defining this?
> 
> As a possible work around, missing data producer could create a file with 
> length 0.
> 
> Regards,
> Mohammad
> 
> 
> 
> 
>  
> 
> 
> ________________________________
> From: lance <[email protected]>
> To: [email protected] 
> Sent: Wednesday, May 8, 2013 4:20 PM
> Subject: Gaps in dataset prevents dependency 
> 
> 
> Would like a more verbose way to define the input-event - and the data-in.
> 
> I'm trying to find ways to handle when a start/end instance isn't satisfied, 
> event thought there is data to be processed.
> An example of this is when I'm parsing a set of 24 hours of logs, and there 
> may be an hour at night that doesn't have anything produced. This use case is 
> exacerbated when we are talking minutes and doing hourly rollups - but same 
> scenario. 
> 
> Here is the example config:
> 
> The coordinator runs every 5 minutes:
> <coordinator-app name="cef-workflow-coordinator" 
> frequency="${coord:minutes(5)}" start="${jobStart}" end="${jobEnd}"
> 
> In this case the input dataset is produced in minutes:
>         <dataset name="logs" frequency="${coord:minutes(1)}"
>             initial-instance="${initialDataset}" 
> timezone="America/Los_Angeles">
>             
> <uri-template>${rootPath}/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}</uri-template>
>             <done-flag></done-flag>
>         </dataset>
> 
> Using the example from the twitter/flume CDH example, the actual indicator 
> that the job should be executed is when a new directory is created that is in 
> the next set of data:  
>     <input-events>
>         <data-in name="input" dataset="logs">
>             
> <start-instance>${coord:current((coord:tzOffset())-5)}</start-instance>
>             <end-instance>${coord:current(coord:tzOffset())}</end-instance>
>         <!--    <instance>${coord:current(coord:tzOffset())}</instance> -->
>         </data-in>
>         <data-in name="readyIndicator" dataset="logs">
>             <instance>${coord:current(1 + (coord:tzOffset()))}</instance>
>         </data-in>
>     </input-events>
> 
> Would like it to be that a directory got created that is in the future from 
> this dataset (the trigger), and then take whatever is available in the last 5 
> datasets (minutes).  
> 
> Instead, if there is a gap in the previous 5 minutes (say no logs came in 
> during minute t-3) then the dependency is never fulfilled.
> 
> 
> Thanks for the help

Reply via email to