Gaps in dataset prevents dependency

lance Wed, 08 May 2013 16:21:37 -0700

Would like a more verbose way to define the input-event - and the data-in.

I'm trying to find ways to handle when a start/end instance isn't satisfied, 
event thought there is data to be processed.
An example of this is when I'm parsing a set of 24 hours of logs, and there may 
be an hour at night that doesn't have anything produced. This use case is 
exacerbated when we are talking minutes and doing hourly rollups - but same 
scenario.


Here is the example config:

The coordinator runs every 5 minutes:
<coordinator-app name="cef-workflow-coordinator" 
frequency="${coord:minutes(5)}" start="${jobStart}" end="${jobEnd}"

In this case the input dataset is produced in minutes:
        <dataset name="logs" frequency="${coord:minutes(1)}"
            initial-instance="${initialDataset}" timezone="America/Los_Angeles">
            
<uri-template>${rootPath}/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}</uri-template>
            <done-flag></done-flag>
        </dataset>

Using the example from the twitter/flume CDH example, the actual indicator that 
the job should be executed is when a new directory is created that is in the 
next set of data:   
    <input-events>
        <data-in name="input" dataset="logs">
            
<start-instance>${coord:current((coord:tzOffset())-5)}</start-instance>
            <end-instance>${coord:current(coord:tzOffset())}</end-instance>
        <!--    <instance>${coord:current(coord:tzOffset())}</instance> -->
        </data-in>
        <data-in name="readyIndicator" dataset="logs">
            <instance>${coord:current(1 + (coord:tzOffset()))}</instance>
        </data-in>
    </input-events>

Would like it to be that a directory got created that is in the future from 
this dataset (the trigger), and then take whatever is available in the last 5 
datasets (minutes).  

Instead, if there is a gap in the previous 5 minutes (say no logs came in 
during minute t-3) then the dependency is never fulfilled.


Thanks for the help

Gaps in dataset prevents dependency

Reply via email to