RE: Data Pipeline - Does oozie support the newly created partitions from step 1 as the input events and parameters for step 2?

Huiting Li Wed, 18 Dec 2013 22:32:39 -0800

I think that couldn't achieve exactly what we want, as we also need the 
workflow to detect and process the dynamically generated partitions in 
iteration. We may need to implement this logic in other ways, instead of using 
oozie directly.


Anyway, thanks all the same, Rohini!

Thanks,
Huiting

-----Original Message-----
From: Rohini Palaniswamy [mailto:[email protected]] 
Sent: Thursday, December 19, 2013 1:37 AM
To: [email protected]
Subject: Re: Data Pipeline - Does oozie support the newly created partitions 
from step 1 as the input events and parameters for step 2?

You can specify more than one instance in data-out. But if the instances 
produced are random, then the only think I can think of is passing the 
partitions created from one action to the next in the workflow through action 
output. You can write any data in a java action and pass it on to the next 
action. Or you can write them to a file in hdfs and let the other action pick 
it up.

https://cwiki.apache.org/confluence/display/OOZIE/Java+Cookbook - Check out 
capture-output

Regards,
Rohini



On Tue, Dec 17, 2013 at 6:12 PM, Huiting Li <[email protected]> wrote:

> It's said that coord:dataOut() resolves to all the URIs for the 
> dataset instance specified in an output event dataset section. From my 
> understanding, the output event is a kind of pre-determined value, as 
> usually coord:current(0) is used in output event. Taking the oozie doc 
> below for example, for the first run, coord:dataOut('outputLogs') will 
> resolve to " hdfs://bar:8020/app/logs/2009/01/02", instead of the 
> actual output of the last step, which may be a few random partitions, right?
>
> So how to specify the output event in my case? Thanks a lot!
>
>
> ====oozie example=====
> <coordinator-app name="app-coord" frequency="${coord:days(1)}"
>                     start="2009-01-01T24:00Z" end="2009-12-31T24:00Z"
> timezone="UTC"
>                     xmlns="uri:oozie:coordinator:0.1">
>       <datasets>
>         <dataset name="dailyLogs" frequency="${coord:days(1)}"
> initial-instance="2009-01-01T24:00Z" timezone="UTC">
>
> <uri-template>hdfs://bar:8020/app/daily-logs/${YEAR}/${MONTH}/${DAY}</uri-template>
>          </dataset>
>       </datasets>
>       <input-events>... </input-events>
>       <output-events>
>         <data-out name="outputLogs" dataset="dailyLogs">
>           <instance>${coord:current(0)}</instance>
>         </data-out>
>       </output-events>
> <action>.....
>        <property>
>               <name>wfOutput</name>
>               <value>${coord:dataOut('outputLogs')}</value>
>        </property>
> </action>
> </coordinator-app>
>
> Thanks,
> Huiting
>
> -----Original Message-----
> From: Rohini Palaniswamy [mailto:[email protected]]
> Sent: 2013年12月18日 9:09
> To: [email protected]
> Subject: Re: Data Pipeline - Does oozie support the newly created 
> partitions from step 1 as the input events and parameters for step 2?
>
> The newly generated partitions should be part of data-out. You can 
> pass the partitions using coord:dataOut() EL function
>
> Regards,
> Rohini
>
>
>
> On Thu, Dec 12, 2013 at 2:12 AM, Huiting Li <[email protected]>
> wrote:
>
> > Hi,
> >
> > In oozie coordinator, we can Using ${coord:current(int n)} to create 
> > a data-pipeline using a coordinator application. It's said that 
> > "${coord:current(int n)} represents the nth dataset instance for a 
> > synchronous dataset, relative to the coordinator action creation
> > (materialization) time. The coordinator action creation
> > (materialization) time is computed based on the coordinator job 
> > start
> time and its frequency.
> > The nth dataset instance is computed based on the dataset's 
> > initial-instance datetime, its frequency and the (current) 
> > coordinator action creation (materialization) time."
> > However, our case is: coordinator starts at for example 
> > 2013-12-12-02, step 1 outputs multiple partitioned data, like 
> > partitions /data/dth= 2013-12-11-22, /data/dth=2013-12-11-23, 
> > /data/dth=2013-12-12-02. We want to process all these newly 
> > generated partitions in step 2. That means, step
> > 2 take the output of step 1 as its input, and will process data in 
> > the new partitions one by one. So if we define a dataset like below 
> > in step 2, how could we define the input events (in </data-in>) and 
> > pass parameters(in configuration property) to step2?
> >           <uri-template>
> >                  hdfs://xxx:8020/data/dth=${YEAR}-${MONTH}-${DAY}-${HOUR}
> >           </uri-template>
> >
> > Does oozie support such kind of pipeline?
> >
> > Thanks,
> > Huiting
> >
>

RE: Data Pipeline - Does oozie support the newly created partitions from step 1 as the input events and parameters for step 2?

Reply via email to