Thanks for the replies. I am seeing the following exception in the logs
during callback.
2014-06-28 15:14:46,048 INFO CallbackServlet:539 - USER[-] GROUP[-]
TOKEN[-] APP[-] JOB[0000000-140628150326547-oozie-hado-W]
ACTION[0000000-140628150326547-oozie-hado-W@person-action] callback for
action [0000000-140628150326547-oozie-hado-W@person-action]
2014-06-28 15:14:46,062 ERROR CompletedActionXCommand:536 - USER[-]
GROUP[-] TOKEN[] APP[-] JOB[0000000-140628150326547-oozie-hado-W]
ACTION[0000000-140628150326547-oozie-hado-W@person-action] XException,
org.apache.oozie.command.CommandException: E0800: Action it is not running
its in [PREP] state, action
[0000000-140628150326547-oozie-hado-W@person-action]
at
org.apache.oozie.command.wf.CompletedActionXCommand.eagerVerifyPrecondition(CompletedActionXCommand.java:85)
at org.apache.oozie.command.XCommand.call(XCommand.java:251)
at
org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:175)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
2014-06-28 15:14:46,062 WARN CallableQueueService$CallableWrapper:542 -
USER[-] GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] exception callable
[callback], E0800: Action it is not running its in [PREP] state, action
[0000000-140628150326547-oozie-hado-W@person-action]
org.apache.oozie.command.CommandException: E0800: Action it is not running
its in [PREP] state, action
[0000000-140628150326547-oozie-hado-W@person-action]
at
org.apache.oozie.command.wf.CompletedActionXCommand.eagerVerifyPrecondition(CompletedActionXCommand.java:85)
at org.apache.oozie.command.XCommand.call(XCommand.java:251)
at
org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:175)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
2014-06-28 15:14:46,076 WARN ActionStartXCommand:542 - USER[hadoop]
GROUP[-] TOKEN[] APP[main-job-wf] JOB[0000000-140628150326547-oozie-hado-W]
ACTION[0000000-140628150326547-oozie-hado-W@person-action]
[***0000000-140628150326547-oozie-hado-W@person-action***]Action
status=RUNNING
2014-06-28 15:14:46,076 WARN ActionStartXCommand:542 - USER[hadoop]
GROUP[-] TOKEN[] APP[main-job-wf] JOB[0000000-140628150326547-oozie-hado-W]
ACTION[0000000-140628150326547-oozie-hado-W@person-action]
[***0000000-140628150326547-oozie-hado-W@person-action***]Action updated in
DB!
2014-06-28 15:14:46,095 INFO ActionStartXCommand:539 - USER[hadoop]
GROUP[-] TOKEN[] APP[person-enrichment-job-wf]
JOB[0000004-140628150326547-oozie-hado-W]
ACTION[0000004-140628150326547-oozie-hado-W@:start:] Start action
[0000004-140628150326547-oozie-hado-W@:start:] with user-retry state :
userRetryCount [0], userRetryMax [0], userRetryInterval [10]
2014-06-28 15:14:46,095 WARN ActionStartXCommand:542 - USER[hadoop]
GROUP[-] TOKEN[] APP[person-enrichment-job-wf]
JOB[0000004-140628150326547-oozie-hado-W]
ACTION[0000004-140628150326547-oozie-hado-W@:start:]
[***0000004-140628150326547-oozie-hado-W@:start:***]Action status=DONE
2014-06-28 15:14:46,095 WARN ActionStartXCommand:542 - USER[hadoop]
GROUP[-] TOKEN[] APP[person-enrichment-job-wf]
JOB[0000004-140628150326547-oozie-hado-W]
ACTION[0000004-140628150326547-oozie-hado-W@:start:]
[***0000004-140628150326547-oozie-hado-W@:start:***]Action updated in DB!
However the pause doesn't seem to happen in the callback. It seems to have
something to do with the actions getting suspended as shown in the
following log entry.
2014-06-28 15:14:51,436 INFO JavaActionExecutor:539 - USER[hadoop]
GROUP[-] TOKEN[] APP[person-enrichment-job-wf]
JOB[0000004-140628150326547-oozie-hado-W]
ACTION[0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob]
checking action, external ID [job_201406260453_0554] status [RUNNING]
2014-06-28 15:14:51,439 WARN ActionStartXCommand:542 - USER[hadoop]
GROUP[-] TOKEN[] APP[person-enrichment-job-wf]
JOB[0000004-140628150326547-oozie-hado-W]
ACTION[0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob]
[***0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob***]Action
status=RUNNING
2014-06-28 15:14:51,440 WARN ActionStartXCommand:542 - USER[hadoop]
GROUP[-] TOKEN[] APP[person-enrichment-job-wf]
JOB[0000004-140628150326547-oozie-hado-W]
ACTION[0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob]
[***0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob***]Action
updated in DB!
2014-06-28 15:15:42,976 WARN ActionCheckXCommand:542 - USER[hadoop]
GROUP[-] TOKEN[] APP[person-enrichment-job-wf]
JOB[0000011-140515062648271-oozie-hado-W]
ACTION[0000011-140515062648271-oozie-hado-W@personEnrichmentJob] E0818:
Action [0000011-140515062648271-oozie-hado-W@personEnrichmentJob] status is
running but WF Job [0000011-140515062648271-oozie-hado-W] status is
[SUSPENDED]. Expected status is RUNNING., Error Code: E0818
Looking at the Hadoop jobtracker UI these jobs completed successfully
at Sat Jun 28 15:15:52 UTC 2014 which seems to be right around the
suspended messages. These actions stay suspended for another ten minutes
before kicking off the next set of actions.
2014-06-28 15:25:43,603 INFO JavaActionExecutor:539 - USER[hadoop]
GROUP[-] TOKEN[] APP[person-enrichment-job-wf]
JOB[0000004-140628150326547-oozie-hado-W]
ACTION[0000004-140628150326547-oozie-hado-W@personEmailEnrichmentJob]
action completed, external ID [job_201406260453_0553]
2014-06-28 15:25:43,607 INFO JavaActionExecutor:539 - USER[hadoop]
GROUP[-] TOKEN[] APP[person-enrichment-job-wf]
JOB[0000004-140628150326547-oozie-hado-W]
ACTION[0000004-140628150326547-oozie-hado-W@personEmailEnrichmentJob]
action produced output
Is this still related to the callback issue you mention or is this
something else with the suspended actions? FWIW I am able to connect to
the Oozie servers from all the nodes in the Hadoop cluster.
Thanks for the help!
Jonathan
On Fri, Jun 27, 2014 at 11:29 AM, Amit Patil <[email protected]>
wrote:
> I think the callback url is not reachable by all the nodes on your hadoop
> cluster. Since, the callback url fails, there is no way for the workers to
> notify the oozie server that the action has finished execution. After 10
> mins, the oozie server polls the worker node to get the status. Thats why
> you are seeing 10 mins pauses.
> Set the following variable in oozie-env.sh to the ipaddres of the server
> instead of the hostname and restart the oozie server
> export OOZIE_HTTP_HOSTNAME=<server ip>
>
>
>
>
> On Fri, Jun 27, 2014 at 10:19 AM, Jonathan Hodges <[email protected]>
> wrote:
>
> > Hi,
> >
> > We have noticed some pauses between execution of action steps in our
> Oozie
> > workflows. A job may take a couple minutes yet the next step isn't
> > triggered for sometimes ten minutes. Is there a way to control this
> pause?
> > We have a set of jobs that when executed at a command-line runs for
> about
> > 45 mins. The same set of jobs in Oozie is over 3 hours.
> >
> > We tried setting the following properties, but we aren't using
> coordinators
> > so it didn't help.
> >
> > oozie.service.CoordMaterializeTriggerService.lookup.interval=30
> > oozie.service.CoordMaterializeTriggerService.materialization.window=7200
> >
> > Is there maybe similar properties at the workflow level we can set?
> >
> > Thanks so much,
> >
> > Jonathan
> >
>