Re: Gaps in dataset prevents dependency

Mona Chitnis Thu, 09 May 2013 13:39:53 -0700

Agree you have a point there. As Mohammad suggested in a similar thread,
we are looking towards adding support for this in the future as well as
Asynchronous Datasets (not timestamp bound).


Thanks,

Mona

On 5/9/13 11:55 AM, "lance" <[email protected]> wrote:

>Makes sense - just seems like a common use case. I saw another thread
>here: 
>http://mail-archives.apache.org/mod_mbox/oozie-user/201304.mbox/%3CCA%2B97
>RN0Q7NH0iO7-7ntFM1ZYUP4OpgdO6udiB2C7Nsp5OjnHEQ%40mail.gmail.com%3E
>
>Same issue -
>
>Seems like there should be native support for this use case?
>
>On May 9, 2013, at 11:39 AM, Mona Chitnis <[email protected]> wrote:
>
>> Hi Lance,
>> 
>> Your use-case is not natively supported in Oozie. Some users with a
>> similar architecture, have formed a solution of a "dropbox" directory
>> where all or some of the incoming data is dumped, and then Oozie
>> coordinator checks for this dropbox as input-event.
>> 
>> --
>> Mona
>> 
>> On 5/9/13 9:29 AM, "lance" <[email protected]> wrote:
>> 
>>> Thanks for the ideas Paul
>>> 
>>> Thanks for the ideas Paul. I'll try some more experiments with just
>>>using
>>> the coordinator trigger. In the mean time I'm also going to see what
>>>else
>>> can be done with the non-existent paths. I think writing a java or
>>>script
>>> action to take care of this is very doable, but also frustrating given
>>> the use case. i.e. seems like oozie should natively embrace this use
>>>case
>>> - keep thinking I'm missing something here.
>>> 
>>> 
>>> 
>>> On May 8, 2013, at 5:58 PM, Paul Chavez
>>><[email protected]>
>>> wrote:
>>> 
>>>> I think you can implement what you want already by not using
>>>> output-events and just having a coordinator trigger on it's configured
>>>> frequency.
>>>> 
>>>> I have a coordinator that uses an output dataset and output event to
>>>> resolve a specific instance. The use case here is a nightly sqoop pull
>>>> from our OLTP DB but it applies here as well, you need a path and it
>>>> (may) not exist. Define an output dataset with the same frequency as
>>>> your coordinator, and pass the path in via an output event. Your
>>>> workflow will have to gracefully handle the non-existence of the path
>>>>in
>>>> this case, but at least your coordinator will always run on it's
>>>> interval.
>>>> 
>>>> 
>>>>    <dataset name="SqoopOut" frequency="${coord:days(1)}"
>>>>                    initial-instance="2012-12-01T07:55Z"
>>>>timezone="America/Los_Angeles">
>>>> 
>>>> 
>>>>                    
>>>> <uri-template>${nameNode}/SqoopOutput/Data${YEAR}${MONTH}${DAY}</uri
>>>>-t
>>>> emplate>
>>>>            </dataset>
>>>>    </datasets>
>>>>    <output-events>
>>>>            <data-out name="SqoopOutPath" dataset="SqoopOut">
>>>>                    <instance>${coord:current(-1)}</instance>
>>>>            </data-out>
>>>>    </output-events>
>>>> 
>>>> Later in the workflow node:
>>>>    <property>
>>>>            <name>SqoopPath</name>
>>>>            <value>${coord:dataOut('SqoopOutPath')}</value>
>>>>    </property>
>>>> 
>>>> I also use flume and am working on a post-processing coordinator for
>>>> various log flows. However in my case we do not expect any gaps in the
>>>> data and want to catch cases where the data has backed up due to an
>>>> upstream agent failure. In this case I put a dependency on the
>>>> current(0) instance for the input-event so that the coordinator waits
>>>> until the previous interval has finished writing. The actual workflow
>>>> only takes a formatted date string to use in a Hive query and resolves
>>>> from coord:nominalTime(-1), giving me the previous instance date.
>>>> However this relies on flume delivering events in the order that they
>>>> were received to guarantee no data is missed, an assumption that may
>>>>not
>>>> hold depending on your architecture.
>>>> 
>>>> Hope that helps,
>>>> Paul Chavez
>>>> 
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: lance [mailto:[email protected]]
>>>> Sent: Wednesday, May 08, 2013 5:19 PM
>>>> To: [email protected]; Mohammad Islam
>>>> Subject: Re: Gaps in dataset prevents dependency
>>>> 
>>>> 
>>>> I think the generalization is to separate the trigger and the data
>>>>set.
>>>> 
>>>> I.e. Trigger can be just how it is today - data files, exitence of
>>>> directories, but, the dataset could be defined separately as all data
>>>> within a range (or however it wants to be defined). In this case there
>>>> is not strict fulfillment, just any data that currently exists that
>>>>in a
>>>> range for instance.
>>>> 
>>>> 
>>>> As far as missing data producer creating a 0 length file - this is
>>>> flume - so I would think that this would be a pretty standard use
>>>>case.
>>>> Even the example in CDH using twitter days is using flume to pull in
>>>> tweets to an oozie workflow - yet says that they had to do a hack in
>>>> order to trigger because there was no good way in flume.  Would like
>>>>to
>>>> see a better way of doing this.
>>>> 
>>>> I think that I have a typical use case where logs are getting
>>>> collected, and batch loader jobs want to be executed fairly frequently
>>>> (say every 5 minutes). However, a hiccup in the network, or slow times
>>>> at night, might leave gaps - or even at startup time there may be logs
>>>> that start somewhere between the cutoffs.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On May 8, 2013, at 5:09 PM, Mohammad Islam <[email protected]> wrote:
>>>> 
>>>>> Hi,
>>>>> Oozie currently supports data-trigger where ALL dependent data-sets
>>>>> are available. There is no way of specifying that if M out of N are
>>>>> available.
>>>>> 
>>>>> This could be a new feature. In that case, can you define a
>>>>> generalized way of defining this?
>>>>> 
>>>>> As a possible work around, missing data producer could create a file
>>>>> with length 0.
>>>>> 
>>>>> Regards,
>>>>> Mohammad
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> ________________________________
>>>>> From: lance <[email protected]>
>>>>> To: [email protected]
>>>>> Sent: Wednesday, May 8, 2013 4:20 PM
>>>>> Subject: Gaps in dataset prevents dependency
>>>>> 
>>>>> 
>>>>> Would like a more verbose way to define the input-event - and the
>>>>> data-in.
>>>>> 
>>>>> I'm trying to find ways to handle when a start/end instance isn't
>>>>> satisfied, event thought there is data to be processed.
>>>>> An example of this is when I'm parsing a set of 24 hours of logs, and
>>>>> there may be an hour at night that doesn't have anything produced.
>>>>>This
>>>>> use case is exacerbated when we are talking minutes and doing hourly
>>>>> rollups - but same scenario.
>>>>> 
>>>>> Here is the example config:
>>>>> 
>>>>> The coordinator runs every 5 minutes:
>>>>> <coordinator-app name="cef-workflow-coordinator"
>>>>> frequency="${coord:minutes(5)}" start="${jobStart}" end="${jobEnd}"
>>>>> 
>>>>> In this case the input dataset is produced in minutes:
>>>>>       <dataset name="logs" frequency="${coord:minutes(1)}"
>>>>>           initial-instance="${initialDataset}"
>>>>> timezone="America/Los_Angeles">
>>>>> 
>>>>> 
>>>>><uri-template>${rootPath}/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}</u
>>>>>ri
>>>>> -template>
>>>>>           <done-flag></done-flag>
>>>>>       </dataset>
>>>>> 
>>>>> Using the example from the twitter/flume CDH example, the actual
>>>>> indicator that the job should be executed is when a new directory is
>>>>> created that is in the next set of data:
>>>>>   <input-events>
>>>>>       <data-in name="input" dataset="logs">
>>>>> 
>>>>> 
>>>>><start-instance>${coord:current((coord:tzOffset())-5)}</start-instance
>>>>>>
>>>>> 
>>>>> <end-instance>${coord:current(coord:tzOffset())}</end-instance>
>>>>>       <!--    <instance>${coord:current(coord:tzOffset())}</instance>
>>>>> -->
>>>>>       </data-in>
>>>>>       <data-in name="readyIndicator" dataset="logs">
>>>>>           <instance>${coord:current(1 +
>>>>> (coord:tzOffset()))}</instance>
>>>>>       </data-in>
>>>>>   </input-events>
>>>>> 
>>>>> Would like it to be that a directory got created that is in the
>>>>>future
>>>>> from this dataset (the trigger), and then take whatever is available
>>>>>in
>>>>> the last 5 datasets (minutes).
>>>>> 
>>>>> Instead, if there is a gap in the previous 5 minutes (say no logs
>>>>>came
>>>>> in during minute t-3) then the dependency is never fulfilled.
>>>>> 
>>>>> 
>>>>> Thanks for the help
>>>> 
>>> 
>> 
>

Re: Gaps in dataset prevents dependency

Reply via email to