In the case I provided the dataset would have an hourly instance. From the
information that you provided this does not apply to you.
I think you will want to do the following:
- set your coordinator to use a daily frequency i.e.
frequency="${coord:days(1)}
- set your dataset to use a daily frequency i.e.
frequency="${coord:days(1)}"
- set your input instance i.e.<instance>${coord:current(0)}</instance>
If you schedule this job to run daily at 00:00 the job (based on your
timeout value) will wait until the done-flag (up_to_eod_iters_SUCCESS) is
written before processing data.
Here's an example of one of my jobs that is very similar to what you are
trying to accomplish:
<coordinator-app name="some-daily-app:dc=${dcNumber}:region=${forRegion}"
frequency="${coord:days(1)}" start="${start}" end="${end}" timezone="UTC"
xmlns="uri:oozie:coordinator:0.4">
<controls>
<timeout>${timeOut}</timeout>
<concurrency>${concurrency}</concurrency>
<execution>${exeOrder}</execution>
<throttle>${throttle}</throttle>
</controls>
<datasets>
<dataset name="logs" frequency="${coord:days(1)}"
initial-instance="${initInstance}" timezone="UTC">
<uri-template>${inputDir}/${YEAR}${MONTH}${DAY}</uri-template>
<done-flag>_READY</done-flag>
</dataset>
<dataset name="dout" frequency="${coord:days(1)}"
initial-instance="${initInstance}" timezone="UTC">
<uri-template>${outputDir}/dt=${YEAR}-${MONTH}-${DAY}</uri-template>
</dataset>
</datasets>
<input-events>
<data-in name="input" dataset="logs">
<instance>${coord:current(0)}</instance>
</data-in>
</input-events>
<output-events>
<data-out name="output" dataset="dout">
<instance>${coord:current(0)}</instance>
</data-out>
</output-events>
On Fri, Nov 6, 2015 at 12:31 PM, Alvin Chyan <[email protected]> wrote:
> Thanks, V.
>
> There¹s only one done flag per day, so what¹s the benefit of using
> start/end-instances?
>
> Also, from the doc:
> The ${coord:current(int offset)} EL function resolves to coordinator
> action creation time minus the specified offset multiplied by the dataset
> frequency. This EL function is properly defined in a subsequent section.
>
> It sounds like current(23) will end up being 23*(24 hours) from now, when
> you set dataset freq to hours(24). Does days(1) actually differ from
> hours(24)?
>
> Admittedly I am pretty confused by the coordinator config, but my
> coordinator is basically just copy and pasted from the
> https://oozie.apache.org/docs/3.1.3-incubating/CoordinatorFunctionalSpec.ht
> ml example for a daily input dataset.
>
> Thanks!
> Alvin
>
>
>
> On 11/5/15, 5:01 AM, "Vincent Peplinski" <[email protected]> wrote:
>
> >Hi Alvin,
> >
> >In cases like this I would set the coordinator frequency coord:days(1)
> >and the dataset to coord:hours(24).
> >
> >My input-events would be set as follows:
> ><start-instance>coord:current(00)</start-instance>
> ><end-instance>coord:current(23)</end-instance>
> >
> >This will result in checking for the done-flag in each hour of the day.
> >
> >I would then schedule the job kickoff at 00:00 every day.
> >
> >V.
> >
> > Original Message
> >From: Alvin Chyan
> >Sent: Wednesday, November 4, 2015 1:54 PM
> >To: [email protected]
> >Reply To: [email protected]
> >Subject: oozie coordinator not waiting for dataset after daylight savings
> >
> >Hi all,
> >Did anyone else experience some bizarre issues with oozie's coordinator
> >after daylight savings time change? Our coordinator was submitted weeks
> >ago at 7pm and scheduled to run every 24 hours. The coordinator is
> >supposed to wait for an input dataset though, so it normally waits until
> >about midnight before the workflow is materialized. However, ever since
> >daylight savings on 11/1, the coordinator would no longer wait and just
> >materialize a workflow instance immediately at 7pm.
> >
> >Here's a part of our coordinator definition:
> ><coordinator-app xmlns="uri:oozie:coordinator:0.2" name="merge"
> >start="${coord:conf('schedule.start')}"
> >end="${coord:conf('schedule.end')}"
> >timezone="US/Pacific"
> >frequency="${coord:hours(24)}">
> ><controls>
> ><timeout>-1</timeout>
> ><concurrency>1</concurrency>
> ></controls>
> ><datasets>
> ><dataset name="all-iters-complete" frequency="${coord:days(1)}"
> >initial-instance="${coord:conf('start')}"
> >timezone="US/Pacific">
> ><uri-template>${coord:conf('namenode')}/process_info/${YEAR}_${MONTH}_${DA
> >Y}</uri-template>
> ><done-flag>up_to_eod_iters_SUCCESS</done-flag>
> ></dataset>
> ></datasets>
> >
> ><input-events>
> ><data-in name="input" dataset="all-iters-complete">
> ><instance>${coord:current(0)}</instance>
> ></data-in>
> ></input-events>
> >...
> >
> >
> >The dataset /process_info/2015_11_01/up_to_eod_iters_SUCCESS gets created
> >early on 2015_11_02, but the workflow kicked off before then.
> >
> >One configuration we had that might affect this was in our oozie-site.xml:
> ><property>
> ><name>oozie.processing.timezone</name>
> ><value>GMT-0800</value>
> ></property>
> >
> >
> >Thanks!
> >Alvin
>
>