Re: Gaps in dataset prevents dependency

Mohammad Islam Wed, 08 May 2013 17:09:55 -0700

Hi,
Oozie currently supports data-trigger where ALL dependent data-sets are 
available. There is no way of specifying that if M out of N are available.


This could be a new feature. In that case, can you define a generalized way of 
defining this?

As a possible work around, missing data producer could create a file with 
length 0.

Regards,
Mohammad




 


________________________________
 From: lance <[email protected]>
To: [email protected] 
Sent: Wednesday, May 8, 2013 4:20 PM
Subject: Gaps in dataset prevents dependency 
 

Would like a more verbose way to define the input-event - and the data-in.

I'm trying to find ways to handle when a start/end instance isn't satisfied, 
event thought there is data to be processed.
An example of this is when I'm parsing a set of 24 hours of logs, and there may 
be an hour at night that doesn't have anything produced. This use case is 
exacerbated when we are talking minutes and doing hourly rollups - but same 
scenario. 

Here is the example config:

The coordinator runs every 5 minutes:
<coordinator-app name="cef-workflow-coordinator" 
frequency="${coord:minutes(5)}" start="${jobStart}" end="${jobEnd}"

In this case the input dataset is produced in minutes:
        <dataset name="logs" frequency="${coord:minutes(1)}"
            initial-instance="${initialDataset}" timezone="America/Los_Angeles">
            
<uri-template>${rootPath}/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}</uri-template>
            <done-flag></done-flag>
        </dataset>

Using the example from the twitter/flume CDH example, the actual indicator that 
the job should be executed is when a new directory is created that is in the 
next set of data:  
    <input-events>
        <data-in name="input" dataset="logs">
            
<start-instance>${coord:current((coord:tzOffset())-5)}</start-instance>
            <end-instance>${coord:current(coord:tzOffset())}</end-instance>
        <!--    <instance>${coord:current(coord:tzOffset())}</instance> -->
        </data-in>
        <data-in name="readyIndicator" dataset="logs">
            <instance>${coord:current(1 + (coord:tzOffset()))}</instance>
        </data-in>
    </input-events>

Would like it to be that a directory got created that is in the future from 
this dataset (the trigger), and then take whatever is available in the last 5 
datasets (minutes).  

Instead, if there is a gap in the previous 5 minutes (say no logs came in 
during minute t-3) then the dependency is never fulfilled.


Thanks for the help

Re: Gaps in dataset prevents dependency

Reply via email to