Perhaps a different way to word my question:
I'd like to have a job which processes data that is created a day
behind the current day. Meaning, if it is the second day of the month,
then only process up to the first day of the month, due to the data
lagging behind. I am using the DistCp action to backup parquet files
to S3 and it should only run up to today's date -1. Every two hours, I
produce a new dataset of the day before. For both the input and
output dataset, I have something like this:
<input-events>
<data-in name="din" dataset="parquetFiles">
<instance>${coord:current(-1)}</instance>
</data-in>
</input-events>
<output-events>
<data-out name="dout" dataset="parquetFilesS3">
<instance>${coord:current(-1)}</instance>
</data-out>
</output-events>
However, when I check the current status of the job, it has already
TIMEDOUT waiting for the input dependency. How can I make Oozie
actually acknowledge the day lag without timing out? Please let me
know if I can provide more information, I could use a hand.
On Mon, Jan 4, 2016 at 1:59 PM, Steve Hanna <[email protected]> wrote:
> Hello all, I have a question about the behavior of a coordinator
> action when data is unavailable and timeouts occur. I'm using the
> built-in DistCP to copy the data locally to s3.
>
> Basically, my situation is as follows. I have incoming data in the
> form of year=blah/month=blah/day=blah/hour=blah. I am looking for a
> file trigger when _SUCCESS is created. This data is created every
> couple of hours by another system. Sometimes the other system doesn't
> produce all the data we expect, at the time interval we expect, such
> that the jobs timeout. The case is also that the data that timed out
> will become available sometime in the future.
>
> My question is about the TIMEOUT. If a coordinator action times out
> and the data was not at the expected site, how can I get Oozie to
> recognize older triggers? That is, the case when a _SUCCESS file
> appears at a filesystem destination that is older than the current
> directory Oozie is watching.
>
> Will it automatically rescan partitions that previously timed out? Is
> there a way to do this in Oozie?
>
> One thing that occurs to me is that we could restart the whole
> coordinator job at some time in the past and have it run forward in
> time? DistCP is currently configured to only transfer new files. This
> is nonideal though -- I'd prefer a solution where Oozie knows which
> datasets are processed or not.
>
> Thank you very much for your time.
--
Steve Hanna, PhD,
Senior Engineer
510.225.5337
Email:: [email protected]
RiskIQ, Inc.
22 Battery St. 10th Floor
San Francisco, CA 94111