It's said that coord:dataOut() resolves to all the URIs for the dataset
instance specified in an output event dataset section. From my understanding,
the output event is a kind of pre-determined value, as usually coord:current(0)
is used in output event. Taking the oozie doc below for example, for the first
run, coord:dataOut('outputLogs') will resolve to "
hdfs://bar:8020/app/logs/2009/01/02", instead of the actual output of the last
step, which may be a few random partitions, right?
So how to specify the output event in my case? Thanks a lot!
====oozie example=====
<coordinator-app name="app-coord" frequency="${coord:days(1)}"
start="2009-01-01T24:00Z" end="2009-12-31T24:00Z"
timezone="UTC"
xmlns="uri:oozie:coordinator:0.1">
<datasets>
<dataset name="dailyLogs" frequency="${coord:days(1)}"
initial-instance="2009-01-01T24:00Z" timezone="UTC">
<uri-template>hdfs://bar:8020/app/daily-logs/${YEAR}/${MONTH}/${DAY}</uri-template>
</dataset>
</datasets>
<input-events>... </input-events>
<output-events>
<data-out name="outputLogs" dataset="dailyLogs">
<instance>${coord:current(0)}</instance>
</data-out>
</output-events>
<action>.....
<property>
<name>wfOutput</name>
<value>${coord:dataOut('outputLogs')}</value>
</property>
</action>
</coordinator-app>
Thanks,
Huiting
-----Original Message-----
From: Rohini Palaniswamy [mailto:[email protected]]
Sent: 2013年12月18日 9:09
To: [email protected]
Subject: Re: Data Pipeline - Does oozie support the newly created partitions
from step 1 as the input events and parameters for step 2?
The newly generated partitions should be part of data-out. You can pass the
partitions using coord:dataOut() EL function
Regards,
Rohini
On Thu, Dec 12, 2013 at 2:12 AM, Huiting Li <[email protected]> wrote:
> Hi,
>
> In oozie coordinator, we can Using ${coord:current(int n)} to create a
> data-pipeline using a coordinator application. It's said that
> "${coord:current(int n)} represents the nth dataset instance for a
> synchronous dataset, relative to the coordinator action creation
> (materialization) time. The coordinator action creation
> (materialization) time is computed based on the coordinator job start time
> and its frequency.
> The nth dataset instance is computed based on the dataset's
> initial-instance datetime, its frequency and the (current) coordinator
> action creation (materialization) time."
> However, our case is: coordinator starts at for example 2013-12-12-02,
> step 1 outputs multiple partitioned data, like partitions /data/dth=
> 2013-12-11-22, /data/dth=2013-12-11-23, /data/dth=2013-12-12-02. We
> want to process all these newly generated partitions in step 2. That
> means, step
> 2 take the output of step 1 as its input, and will process data in the
> new partitions one by one. So if we define a dataset like below in
> step 2, how could we define the input events (in </data-in>) and pass
> parameters(in configuration property) to step2?
> <uri-template>
> hdfs://xxx:8020/data/dth=${YEAR}-${MONTH}-${DAY}-${HOUR}
> </uri-template>
>
> Does oozie support such kind of pipeline?
>
> Thanks,
> Huiting
>