Hi Paul,

You are correct about the data dependency and providing multiple data sets as input event. Maybe I am over thinking about cases when I have to rerun jobs and how to handle them.

Although the job will satisfy data dependency condition, which POI to process in which hour will still be dynamic based on the closing time of that POI. So suppose if something goes wrong in the previous jobs of the pipeline (consider delayed data), then we have to rerun for all the POI's again. That is where I am not comfortable.

Will think more and post the solution if I find any.

Thanks,
Regards,


On 07/25/2014 08:28 AM, Paul Han wrote:
Hi, Harshal
Based on doc, a coordinator job instance materialized each hour could take
multiple datasets as input event, e.g.:
<coordinator-app start=, end=, interval=60,>
<datasets>
   <dataset name="logs" />
   <dataset name="siteAccessStats"/>
</datasets>
..
<input-events>
    <data-in name="input1" ... />
    <data-in name="input2" ... />
...
</input-events>
<action>
         <workflow>
           <app-path />
           <configuration>
             <property>
               <name>wfInput1</name>
               <value>${coord:dataIn('input1')}</value>
             </property>
             <property>
               <name>wfInput2</name>
               <value>${coord:dataIn('input2')}</value>
             </property>
...
In the action workflow, you could potentially check each dataset's
readiness and handle accordingly. I haven't done such a complicated case
yet with my project, please feel free to let me know if I'm wrong.

-paul



On Tue, Jul 22, 2014 at 9:50 PM, Harshal Vora <[email protected]> wrote:

Hi,

Thanks for the reply Paul.
Although this solution works, the issue is that you end up writing your
own data dependency logic (i.e. to check that data sets required for each
POI are processed based on the time zone of the POI).
Data dependency is one of the major functionality provided by Oozie as
compared to any other scheduler.

Any thoughts on this?

Regards,


On 07/20/2014 04:27 AM, Paul Han wrote:

U could schedule coordinator job to "wake up" every hour or whatever
interval (>= 5 mins ?) to process POIs which are ready.

Afterwards, I think this is one of the limits for oozie. It's not obvious
how to deal with variable sets of data out of box.

I would process it with one "master" workflow, probably with help of java
action.

Thanks,
Paul

  On Jul 18, 2014, at 22:41, Harshal Vora <[email protected]> wrote:
Hi,

Any ideas on this?

Regards,

  On 07/17/2014 10:37 AM, Harshal Vora wrote:
Hi,

We are in a situation where we want to crunch data on a daily basis for
a set of Point of Interests(POI).
The issue is, these POI's have a different opening time and even worst,
different closing time, some even go beyond midnight.

Also, they are in different time zones.
Clearly one coordinator job that keeps running at midnight will not
suffice the requirement.
Nor is it feasible to submit and maintain separate coordinator jobs for
each POI.

Is there a better way to tackle this?

Thanks,
Regards,


Reply via email to