Re: Gaps in dataset prevents dependency

lance Thu, 09 May 2013 11:56:22 -0700

Makes sense - just seems like a common use case. I saw another thread here: 
http://mail-archives.apache.org/mod_mbox/oozie-user/201304.mbox/%3CCA%2B97RN0Q7NH0iO7-7ntFM1ZYUP4OpgdO6udiB2C7Nsp5OjnHEQ%40mail.gmail.com%3E


Same issue -

Seems like there should be native support for this use case?

On May 9, 2013, at 11:39 AM, Mona Chitnis <[email protected]> wrote:

> Hi Lance,
> 
> Your use-case is not natively supported in Oozie. Some users with a
> similar architecture, have formed a solution of a "dropbox" directory
> where all or some of the incoming data is dumped, and then Oozie
> coordinator checks for this dropbox as input-event.
> 
> --
> Mona
> 
> On 5/9/13 9:29 AM, "lance" <[email protected]> wrote:
> 
>> Thanks for the ideas Paul
>> 
>> Thanks for the ideas Paul. I'll try some more experiments with just using
>> the coordinator trigger. In the mean time I'm also going to see what else
>> can be done with the non-existent paths. I think writing a java or script
>> action to take care of this is very doable, but also frustrating given
>> the use case. i.e. seems like oozie should natively embrace this use case
>> - keep thinking I'm missing something here.
>> 
>> 
>> 
>> On May 8, 2013, at 5:58 PM, Paul Chavez <[email protected]>
>> wrote:
>> 
>>> I think you can implement what you want already by not using
>>> output-events and just having a coordinator trigger on it's configured
>>> frequency.
>>> 
>>> I have a coordinator that uses an output dataset and output event to
>>> resolve a specific instance. The use case here is a nightly sqoop pull
>>> from our OLTP DB but it applies here as well, you need a path and it
>>> (may) not exist. Define an output dataset with the same frequency as
>>> your coordinator, and pass the path in via an output event. Your
>>> workflow will have to gracefully handle the non-existence of the path in
>>> this case, but at least your coordinator will always run on it's
>>> interval.
>>> 
>>> 
>>>     <dataset name="SqoopOut" frequency="${coord:days(1)}"
>>>                     initial-instance="2012-12-01T07:55Z" 
>>> timezone="America/Los_Angeles">
>>> 
>>>                     
>>> <uri-template>${nameNode}/SqoopOutput/Data${YEAR}${MONTH}${DAY}</uri-t
>>> emplate>
>>>             </dataset>
>>>     </datasets>
>>>     <output-events>
>>>             <data-out name="SqoopOutPath" dataset="SqoopOut">
>>>                     <instance>${coord:current(-1)}</instance>
>>>             </data-out>
>>>     </output-events>
>>> 
>>> Later in the workflow node:
>>>     <property>
>>>             <name>SqoopPath</name>
>>>             <value>${coord:dataOut('SqoopOutPath')}</value>
>>>     </property>
>>> 
>>> I also use flume and am working on a post-processing coordinator for
>>> various log flows. However in my case we do not expect any gaps in the
>>> data and want to catch cases where the data has backed up due to an
>>> upstream agent failure. In this case I put a dependency on the
>>> current(0) instance for the input-event so that the coordinator waits
>>> until the previous interval has finished writing. The actual workflow
>>> only takes a formatted date string to use in a Hive query and resolves
>>> from coord:nominalTime(-1), giving me the previous instance date.
>>> However this relies on flume delivering events in the order that they
>>> were received to guarantee no data is missed, an assumption that may not
>>> hold depending on your architecture.
>>> 
>>> Hope that helps,
>>> Paul Chavez
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: lance [mailto:[email protected]]
>>> Sent: Wednesday, May 08, 2013 5:19 PM
>>> To: [email protected]; Mohammad Islam
>>> Subject: Re: Gaps in dataset prevents dependency
>>> 
>>> 
>>> I think the generalization is to separate the trigger and the data set.
>>> 
>>> I.e. Trigger can be just how it is today - data files, exitence of
>>> directories, but, the dataset could be defined separately as all data
>>> within a range (or however it wants to be defined). In this case there
>>> is not strict fulfillment, just any data that currently exists that in a
>>> range for instance.
>>> 
>>> 
>>> As far as missing data producer creating a 0 length file - this is
>>> flume - so I would think that this would be a pretty standard use case.
>>> Even the example in CDH using twitter days is using flume to pull in
>>> tweets to an oozie workflow - yet says that they had to do a hack in
>>> order to trigger because there was no good way in flume.  Would like to
>>> see a better way of doing this.
>>> 
>>> I think that I have a typical use case where logs are getting
>>> collected, and batch loader jobs want to be executed fairly frequently
>>> (say every 5 minutes). However, a hiccup in the network, or slow times
>>> at night, might leave gaps - or even at startup time there may be logs
>>> that start somewhere between the cutoffs.
>>> 
>>> 
>>> 
>>> 
>>> On May 8, 2013, at 5:09 PM, Mohammad Islam <[email protected]> wrote:
>>> 
>>>> Hi,
>>>> Oozie currently supports data-trigger where ALL dependent data-sets
>>>> are available. There is no way of specifying that if M out of N are
>>>> available.
>>>> 
>>>> This could be a new feature. In that case, can you define a
>>>> generalized way of defining this?
>>>> 
>>>> As a possible work around, missing data producer could create a file
>>>> with length 0.
>>>> 
>>>> Regards,
>>>> Mohammad
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ________________________________
>>>> From: lance <[email protected]>
>>>> To: [email protected]
>>>> Sent: Wednesday, May 8, 2013 4:20 PM
>>>> Subject: Gaps in dataset prevents dependency
>>>> 
>>>> 
>>>> Would like a more verbose way to define the input-event - and the
>>>> data-in.
>>>> 
>>>> I'm trying to find ways to handle when a start/end instance isn't
>>>> satisfied, event thought there is data to be processed.
>>>> An example of this is when I'm parsing a set of 24 hours of logs, and
>>>> there may be an hour at night that doesn't have anything produced. This
>>>> use case is exacerbated when we are talking minutes and doing hourly
>>>> rollups - but same scenario.
>>>> 
>>>> Here is the example config:
>>>> 
>>>> The coordinator runs every 5 minutes:
>>>> <coordinator-app name="cef-workflow-coordinator"
>>>> frequency="${coord:minutes(5)}" start="${jobStart}" end="${jobEnd}"
>>>> 
>>>> In this case the input dataset is produced in minutes:
>>>>       <dataset name="logs" frequency="${coord:minutes(1)}"
>>>>           initial-instance="${initialDataset}"
>>>> timezone="America/Los_Angeles">
>>>> 
>>>> <uri-template>${rootPath}/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}</uri
>>>> -template>
>>>>           <done-flag></done-flag>
>>>>       </dataset>
>>>> 
>>>> Using the example from the twitter/flume CDH example, the actual
>>>> indicator that the job should be executed is when a new directory is
>>>> created that is in the next set of data:
>>>>   <input-events>
>>>>       <data-in name="input" dataset="logs">
>>>> 
>>>> <start-instance>${coord:current((coord:tzOffset())-5)}</start-instance>
>>>> 
>>>> <end-instance>${coord:current(coord:tzOffset())}</end-instance>
>>>>       <!--    <instance>${coord:current(coord:tzOffset())}</instance>
>>>> -->
>>>>       </data-in>
>>>>       <data-in name="readyIndicator" dataset="logs">
>>>>           <instance>${coord:current(1 +
>>>> (coord:tzOffset()))}</instance>
>>>>       </data-in>
>>>>   </input-events>
>>>> 
>>>> Would like it to be that a directory got created that is in the future
>>>> from this dataset (the trigger), and then take whatever is available in
>>>> the last 5 datasets (minutes).
>>>> 
>>>> Instead, if there is a gap in the previous 5 minutes (say no logs came
>>>> in during minute t-3) then the dependency is never fulfilled.
>>>> 
>>>> 
>>>> Thanks for the help
>>> 
>> 
>

Re: Gaps in dataset prevents dependency

Reply via email to