RE: Data Pipeline - Does oozie support the newly created partitions from step 1 as the input events and parameters for step 2?

Huiting Li Tue, 17 Dec 2013 18:13:21 -0800

It's said that coord:dataOut() resolves to all the URIs for the dataset 
instance specified in an output event dataset section. From my understanding, 
the output event is a kind of pre-determined value, as usually coord:current(0) 
is used in output event. Taking the oozie doc below for example, for the first 
run, coord:dataOut('outputLogs') will resolve to " 
hdfs://bar:8020/app/logs/2009/01/02", instead of the actual output of the last 
step, which may be a few random partitions, right?


So how to specify the output event in my case? Thanks a lot!


====oozie example=====
<coordinator-app name="app-coord" frequency="${coord:days(1)}"
                    start="2009-01-01T24:00Z" end="2009-12-31T24:00Z" 
timezone="UTC"
                    xmlns="uri:oozie:coordinator:0.1">
      <datasets>
        <dataset name="dailyLogs" frequency="${coord:days(1)}" 
initial-instance="2009-01-01T24:00Z" timezone="UTC">
                
<uri-template>hdfs://bar:8020/app/daily-logs/${YEAR}/${MONTH}/${DAY}</uri-template>
         </dataset>
      </datasets>
      <input-events>... </input-events>
      <output-events>
        <data-out name="outputLogs" dataset="dailyLogs">
          <instance>${coord:current(0)}</instance>
        </data-out>
      </output-events>
<action>.....
       <property>
              <name>wfOutput</name>
              <value>${coord:dataOut('outputLogs')}</value>
       </property> 
</action>
</coordinator-app>

Thanks,
Huiting

-----Original Message-----
From: Rohini Palaniswamy [mailto:[email protected]] 
Sent: 2013年12月18日 9:09
To: [email protected]
Subject: Re: Data Pipeline - Does oozie support the newly created partitions 
from step 1 as the input events and parameters for step 2?

The newly generated partitions should be part of data-out. You can pass the 
partitions using coord:dataOut() EL function

Regards,
Rohini



On Thu, Dec 12, 2013 at 2:12 AM, Huiting Li <[email protected]> wrote:

> Hi,
>
> In oozie coordinator, we can Using ${coord:current(int n)} to create a 
> data-pipeline using a coordinator application. It's said that 
> "${coord:current(int n)} represents the nth dataset instance for a 
> synchronous dataset, relative to the coordinator action creation
> (materialization) time. The coordinator action creation 
> (materialization) time is computed based on the coordinator job start time 
> and its frequency.
> The nth dataset instance is computed based on the dataset's 
> initial-instance datetime, its frequency and the (current) coordinator 
> action creation (materialization) time."
> However, our case is: coordinator starts at for example 2013-12-12-02, 
> step 1 outputs multiple partitioned data, like partitions /data/dth= 
> 2013-12-11-22, /data/dth=2013-12-11-23, /data/dth=2013-12-12-02. We 
> want to process all these newly generated partitions in step 2. That 
> means, step
> 2 take the output of step 1 as its input, and will process data in the 
> new partitions one by one. So if we define a dataset like below in 
> step 2, how could we define the input events (in </data-in>) and pass 
> parameters(in configuration property) to step2?
>           <uri-template>
>                  hdfs://xxx:8020/data/dth=${YEAR}-${MONTH}-${DAY}-${HOUR}
>           </uri-template>
>
> Does oozie support such kind of pipeline?
>
> Thanks,
> Huiting
>

RE: Data Pipeline - Does oozie support the newly created partitions from step 1 as the input events and parameters for step 2?

Reply via email to