Would like a more verbose way to define the input-event - and the data-in.
I'm trying to find ways to handle when a start/end instance isn't satisfied,
event thought there is data to be processed.
An example of this is when I'm parsing a set of 24 hours of logs, and there may
be an hour at night that doesn't have anything produced. This use case is
exacerbated when we are talking minutes and doing hourly rollups - but same
scenario.
Here is the example config:
The coordinator runs every 5 minutes:
<coordinator-app name="cef-workflow-coordinator"
frequency="${coord:minutes(5)}" start="${jobStart}" end="${jobEnd}"
In this case the input dataset is produced in minutes:
<dataset name="logs" frequency="${coord:minutes(1)}"
initial-instance="${initialDataset}" timezone="America/Los_Angeles">
<uri-template>${rootPath}/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}</uri-template>
<done-flag></done-flag>
</dataset>
Using the example from the twitter/flume CDH example, the actual indicator that
the job should be executed is when a new directory is created that is in the
next set of data:
<input-events>
<data-in name="input" dataset="logs">
<start-instance>${coord:current((coord:tzOffset())-5)}</start-instance>
<end-instance>${coord:current(coord:tzOffset())}</end-instance>
<!-- <instance>${coord:current(coord:tzOffset())}</instance> -->
</data-in>
<data-in name="readyIndicator" dataset="logs">
<instance>${coord:current(1 + (coord:tzOffset()))}</instance>
</data-in>
</input-events>
Would like it to be that a directory got created that is in the future from
this dataset (the trigger), and then take whatever is available in the last 5
datasets (minutes).
Instead, if there is a gap in the previous 5 minutes (say no logs came in
during minute t-3) then the dependency is never fulfilled.
Thanks for the help