Sorry I didn't get back to you earlier.  I'm glad you were able to figure
out what was blocking the callback URL on your own though!



On Mon, Jun 30, 2014 at 1:34 PM, Jonathan Hodges <[email protected]> wrote:

> Okay the callback URL was definitely the problem.  The ec2 hostname of the
> Oozie server wasn't resolvable which was causing the issue.  Since we have
> multiple Oozie server nodes we will set OOZIE_HTTP_HOSTNAME to use a
> VIP/ELB in the oozie-sys.sh.
>
> Thanks so much for the help!
>
>
>
> On Mon, Jun 30, 2014 at 12:04 PM, Jonathan Hodges <[email protected]>
> wrote:
>
> > Is there a way I can determine the exact callback URL to verify
> > connectivity from each of the Hadoop nodes?  I guess maybe the actions
> are
> > getting suspended until callback URL is called or manual check every ten
> > minutes as mentioned.
> >
> >
> >
> > On Sat, Jun 28, 2014 at 9:50 AM, Jonathan Hodges <[email protected]>
> > wrote:
> >
> >> Thanks for the replies.  I am seeing the following exception in the logs
> >> during callback.
> >>
> >>
> >> 2014-06-28 15:14:46,048  INFO CallbackServlet:539 - USER[-] GROUP[-]
> >> TOKEN[-] APP[-] JOB[0000000-140628150326547-oozie-hado-W]
> >> ACTION[0000000-140628150326547-oozie-hado-W@person-action] callback for
> >> action [0000000-140628150326547-oozie-hado-W@person-action]
> >> 2014-06-28 15:14:46,062 ERROR CompletedActionXCommand:536 - USER[-]
> >> GROUP[-] TOKEN[] APP[-] JOB[0000000-140628150326547-oozie-hado-W]
> >> ACTION[0000000-140628150326547-oozie-hado-W@person-action] XException,
> >> org.apache.oozie.command.CommandException: E0800: Action it is not
> >> running its in [PREP] state, action
> >> [0000000-140628150326547-oozie-hado-W@person-action]
> >>         at
> >>
> org.apache.oozie.command.wf.CompletedActionXCommand.eagerVerifyPrecondition(CompletedActionXCommand.java:85)
> >>         at org.apache.oozie.command.XCommand.call(XCommand.java:251)
> >>         at
> >>
> org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:175)
> >>         at
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> >>         at
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> >>         at java.lang.Thread.run(Thread.java:662)
> >> 2014-06-28 15:14:46,062  WARN CallableQueueService$CallableWrapper:542 -
> >> USER[-] GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] exception callable
> >> [callback], E0800: Action it is not running its in [PREP] state, action
> >> [0000000-140628150326547-oozie-hado-W@person-action]
> >> org.apache.oozie.command.CommandException: E0800: Action it is not
> >> running its in [PREP] state, action
> >> [0000000-140628150326547-oozie-hado-W@person-action]
> >>         at
> >>
> org.apache.oozie.command.wf.CompletedActionXCommand.eagerVerifyPrecondition(CompletedActionXCommand.java:85)
> >>         at org.apache.oozie.command.XCommand.call(XCommand.java:251)
> >>         at
> >>
> org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:175)
> >>         at
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> >>         at
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> >>         at java.lang.Thread.run(Thread.java:662)
> >> 2014-06-28 15:14:46,076  WARN ActionStartXCommand:542 - USER[hadoop]
> >> GROUP[-] TOKEN[] APP[main-job-wf]
> JOB[0000000-140628150326547-oozie-hado-W]
> >> ACTION[0000000-140628150326547-oozie-hado-W@person-action]
> >> [***0000000-140628150326547-oozie-hado-W@person-action***]Action
> >> status=RUNNING
> >> 2014-06-28 15:14:46,076  WARN ActionStartXCommand:542 - USER[hadoop]
> >> GROUP[-] TOKEN[] APP[main-job-wf]
> JOB[0000000-140628150326547-oozie-hado-W]
> >> ACTION[0000000-140628150326547-oozie-hado-W@person-action]
> >> [***0000000-140628150326547-oozie-hado-W@person-action***]Action
> updated
> >> in DB!
> >> 2014-06-28 15:14:46,095  INFO ActionStartXCommand:539 - USER[hadoop]
> >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf]
> >> JOB[0000004-140628150326547-oozie-hado-W]
> >> ACTION[0000004-140628150326547-oozie-hado-W@:start:] Start action
> >> [0000004-140628150326547-oozie-hado-W@:start:] with user-retry state :
> >> userRetryCount [0], userRetryMax [0], userRetryInterval [10]
> >> 2014-06-28 15:14:46,095  WARN ActionStartXCommand:542 - USER[hadoop]
> >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf]
> >> JOB[0000004-140628150326547-oozie-hado-W]
> >> ACTION[0000004-140628150326547-oozie-hado-W@:start:]
> >> [***0000004-140628150326547-oozie-hado-W@:start:***]Action status=DONE
> >> 2014-06-28 15:14:46,095  WARN ActionStartXCommand:542 - USER[hadoop]
> >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf]
> >> JOB[0000004-140628150326547-oozie-hado-W]
> >> ACTION[0000004-140628150326547-oozie-hado-W@:start:]
> >> [***0000004-140628150326547-oozie-hado-W@:start:***]Action updated in
> DB!
> >>
> >>
> >>
> >> However the pause doesn't seem to happen in the callback.  It seems to
> >> have something to do with the actions getting suspended as shown in the
> >> following log entry.
> >>
> >>
> >>
> >> 2014-06-28 15:14:51,436  INFO JavaActionExecutor:539 - USER[hadoop]
> >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf]
> >> JOB[0000004-140628150326547-oozie-hado-W]
> >> ACTION[0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob
> ]
> >> checking action, external ID [job_201406260453_0554] status [RUNNING]
> >> 2014-06-28 15:14:51,439  WARN ActionStartXCommand:542 - USER[hadoop]
> >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf]
> >> JOB[0000004-140628150326547-oozie-hado-W]
> >> ACTION[0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob
> ]
> >> [***0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob
> ***]Action
> >> status=RUNNING
> >> 2014-06-28 15:14:51,440  WARN ActionStartXCommand:542 - USER[hadoop]
> >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf]
> >> JOB[0000004-140628150326547-oozie-hado-W]
> >> ACTION[0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob
> ]
> >> [***0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob
> ***]Action
> >> updated in DB!
> >> 2014-06-28 15:15:42,976  WARN ActionCheckXCommand:542 - USER[hadoop]
> >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf]
> >> JOB[0000011-140515062648271-oozie-hado-W]
> >> ACTION[0000011-140515062648271-oozie-hado-W@personEnrichmentJob] E0818:
> >> Action [0000011-140515062648271-oozie-hado-W@personEnrichmentJob]
> status
> >> is running but WF Job [0000011-140515062648271-oozie-hado-W] status is
> >> [SUSPENDED]. Expected status is RUNNING., Error Code: E0818
> >>
> >>
> >> Looking at the Hadoop jobtracker UI these jobs completed successfully
> >> at Sat Jun 28 15:15:52 UTC 2014 which seems to be right around the
> >> suspended messages.  These actions stay suspended for another ten
> minutes
> >> before kicking off the next set of actions.
> >>
> >>
> >> 2014-06-28 15:25:43,603  INFO JavaActionExecutor:539 - USER[hadoop]
> >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf]
> >> JOB[0000004-140628150326547-oozie-hado-W]
> >> ACTION[0000004-140628150326547-oozie-hado-W@personEmailEnrichmentJob]
> >> action completed, external ID [job_201406260453_0553]
> >> 2014-06-28 15:25:43,607  INFO JavaActionExecutor:539 - USER[hadoop]
> >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf]
> >> JOB[0000004-140628150326547-oozie-hado-W]
> >> ACTION[0000004-140628150326547-oozie-hado-W@personEmailEnrichmentJob]
> >> action produced output
> >>
> >>
> >>
> >> Is this still related to the callback issue you mention or is this
> >> something else with the suspended actions?  FWIW I am able to connect to
> >> the Oozie servers from all the nodes in the Hadoop cluster.
> >>
> >> Thanks for the help!
> >>
> >> Jonathan
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Jun 27, 2014 at 11:29 AM, Amit Patil <
> [email protected]
> >> > wrote:
> >>
> >>> I think the callback url is not reachable by all the nodes on your
> hadoop
> >>> cluster. Since, the callback url fails, there is no way for the workers
> >>> to
> >>> notify the oozie server that the action has finished execution. After
> 10
> >>> mins, the oozie server polls the worker node to get the status. Thats
> why
> >>> you are seeing 10 mins pauses.
> >>> Set the following variable in  oozie-env.sh to the ipaddres of the
> server
> >>> instead of the hostname and restart the oozie server
> >>> export OOZIE_HTTP_HOSTNAME=<server ip>
> >>>
> >>>
> >>>
> >>>
> >>> On Fri, Jun 27, 2014 at 10:19 AM, Jonathan Hodges <[email protected]>
> >>> wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> > We have noticed some pauses between execution of action steps in our
> >>> Oozie
> >>> > workflows.  A job may take a couple minutes yet the next step isn't
> >>> > triggered for sometimes ten minutes.  Is there a way to control this
> >>> pause?
> >>> >  We have a set of jobs that when executed at a command-line runs for
> >>> about
> >>> > 45 mins.  The same set of jobs in Oozie is over 3 hours.
> >>> >
> >>> > We tried setting the following properties, but we aren't using
> >>> coordinators
> >>> > so it didn't help.
> >>> >
> >>> > oozie.service.CoordMaterializeTriggerService.lookup.interval=30
> >>> >
> >>>
> oozie.service.CoordMaterializeTriggerService.materialization.window=7200
> >>> >
> >>> > Is there maybe similar properties at the workflow level we can set?
> >>> >
> >>> > Thanks so much,
> >>> >
> >>> > Jonathan
> >>> >
> >>>
> >>
> >>
> >
>

Reply via email to