Okay the callback URL was definitely the problem. The ec2 hostname of the Oozie server wasn't resolvable which was causing the issue. Since we have multiple Oozie server nodes we will set OOZIE_HTTP_HOSTNAME to use a VIP/ELB in the oozie-sys.sh.
Thanks so much for the help! On Mon, Jun 30, 2014 at 12:04 PM, Jonathan Hodges <[email protected]> wrote: > Is there a way I can determine the exact callback URL to verify > connectivity from each of the Hadoop nodes? I guess maybe the actions are > getting suspended until callback URL is called or manual check every ten > minutes as mentioned. > > > > On Sat, Jun 28, 2014 at 9:50 AM, Jonathan Hodges <[email protected]> > wrote: > >> Thanks for the replies. I am seeing the following exception in the logs >> during callback. >> >> >> 2014-06-28 15:14:46,048 INFO CallbackServlet:539 - USER[-] GROUP[-] >> TOKEN[-] APP[-] JOB[0000000-140628150326547-oozie-hado-W] >> ACTION[0000000-140628150326547-oozie-hado-W@person-action] callback for >> action [0000000-140628150326547-oozie-hado-W@person-action] >> 2014-06-28 15:14:46,062 ERROR CompletedActionXCommand:536 - USER[-] >> GROUP[-] TOKEN[] APP[-] JOB[0000000-140628150326547-oozie-hado-W] >> ACTION[0000000-140628150326547-oozie-hado-W@person-action] XException, >> org.apache.oozie.command.CommandException: E0800: Action it is not >> running its in [PREP] state, action >> [0000000-140628150326547-oozie-hado-W@person-action] >> at >> org.apache.oozie.command.wf.CompletedActionXCommand.eagerVerifyPrecondition(CompletedActionXCommand.java:85) >> at org.apache.oozie.command.XCommand.call(XCommand.java:251) >> at >> org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:175) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) >> at java.lang.Thread.run(Thread.java:662) >> 2014-06-28 15:14:46,062 WARN CallableQueueService$CallableWrapper:542 - >> USER[-] GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] exception callable >> [callback], E0800: Action it is not running its in [PREP] state, action >> [0000000-140628150326547-oozie-hado-W@person-action] >> org.apache.oozie.command.CommandException: E0800: Action it is not >> running its in [PREP] state, action >> [0000000-140628150326547-oozie-hado-W@person-action] >> at >> org.apache.oozie.command.wf.CompletedActionXCommand.eagerVerifyPrecondition(CompletedActionXCommand.java:85) >> at org.apache.oozie.command.XCommand.call(XCommand.java:251) >> at >> org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:175) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) >> at java.lang.Thread.run(Thread.java:662) >> 2014-06-28 15:14:46,076 WARN ActionStartXCommand:542 - USER[hadoop] >> GROUP[-] TOKEN[] APP[main-job-wf] JOB[0000000-140628150326547-oozie-hado-W] >> ACTION[0000000-140628150326547-oozie-hado-W@person-action] >> [***0000000-140628150326547-oozie-hado-W@person-action***]Action >> status=RUNNING >> 2014-06-28 15:14:46,076 WARN ActionStartXCommand:542 - USER[hadoop] >> GROUP[-] TOKEN[] APP[main-job-wf] JOB[0000000-140628150326547-oozie-hado-W] >> ACTION[0000000-140628150326547-oozie-hado-W@person-action] >> [***0000000-140628150326547-oozie-hado-W@person-action***]Action updated >> in DB! >> 2014-06-28 15:14:46,095 INFO ActionStartXCommand:539 - USER[hadoop] >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf] >> JOB[0000004-140628150326547-oozie-hado-W] >> ACTION[0000004-140628150326547-oozie-hado-W@:start:] Start action >> [0000004-140628150326547-oozie-hado-W@:start:] with user-retry state : >> userRetryCount [0], userRetryMax [0], userRetryInterval [10] >> 2014-06-28 15:14:46,095 WARN ActionStartXCommand:542 - USER[hadoop] >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf] >> JOB[0000004-140628150326547-oozie-hado-W] >> ACTION[0000004-140628150326547-oozie-hado-W@:start:] >> [***0000004-140628150326547-oozie-hado-W@:start:***]Action status=DONE >> 2014-06-28 15:14:46,095 WARN ActionStartXCommand:542 - USER[hadoop] >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf] >> JOB[0000004-140628150326547-oozie-hado-W] >> ACTION[0000004-140628150326547-oozie-hado-W@:start:] >> [***0000004-140628150326547-oozie-hado-W@:start:***]Action updated in DB! >> >> >> >> However the pause doesn't seem to happen in the callback. It seems to >> have something to do with the actions getting suspended as shown in the >> following log entry. >> >> >> >> 2014-06-28 15:14:51,436 INFO JavaActionExecutor:539 - USER[hadoop] >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf] >> JOB[0000004-140628150326547-oozie-hado-W] >> ACTION[0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob] >> checking action, external ID [job_201406260453_0554] status [RUNNING] >> 2014-06-28 15:14:51,439 WARN ActionStartXCommand:542 - USER[hadoop] >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf] >> JOB[0000004-140628150326547-oozie-hado-W] >> ACTION[0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob] >> [***0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob***]Action >> status=RUNNING >> 2014-06-28 15:14:51,440 WARN ActionStartXCommand:542 - USER[hadoop] >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf] >> JOB[0000004-140628150326547-oozie-hado-W] >> ACTION[0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob] >> [***0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob***]Action >> updated in DB! >> 2014-06-28 15:15:42,976 WARN ActionCheckXCommand:542 - USER[hadoop] >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf] >> JOB[0000011-140515062648271-oozie-hado-W] >> ACTION[0000011-140515062648271-oozie-hado-W@personEnrichmentJob] E0818: >> Action [0000011-140515062648271-oozie-hado-W@personEnrichmentJob] status >> is running but WF Job [0000011-140515062648271-oozie-hado-W] status is >> [SUSPENDED]. Expected status is RUNNING., Error Code: E0818 >> >> >> Looking at the Hadoop jobtracker UI these jobs completed successfully >> at Sat Jun 28 15:15:52 UTC 2014 which seems to be right around the >> suspended messages. These actions stay suspended for another ten minutes >> before kicking off the next set of actions. >> >> >> 2014-06-28 15:25:43,603 INFO JavaActionExecutor:539 - USER[hadoop] >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf] >> JOB[0000004-140628150326547-oozie-hado-W] >> ACTION[0000004-140628150326547-oozie-hado-W@personEmailEnrichmentJob] >> action completed, external ID [job_201406260453_0553] >> 2014-06-28 15:25:43,607 INFO JavaActionExecutor:539 - USER[hadoop] >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf] >> JOB[0000004-140628150326547-oozie-hado-W] >> ACTION[0000004-140628150326547-oozie-hado-W@personEmailEnrichmentJob] >> action produced output >> >> >> >> Is this still related to the callback issue you mention or is this >> something else with the suspended actions? FWIW I am able to connect to >> the Oozie servers from all the nodes in the Hadoop cluster. >> >> Thanks for the help! >> >> Jonathan >> >> >> >> >> >> On Fri, Jun 27, 2014 at 11:29 AM, Amit Patil <[email protected] >> > wrote: >> >>> I think the callback url is not reachable by all the nodes on your hadoop >>> cluster. Since, the callback url fails, there is no way for the workers >>> to >>> notify the oozie server that the action has finished execution. After 10 >>> mins, the oozie server polls the worker node to get the status. Thats why >>> you are seeing 10 mins pauses. >>> Set the following variable in oozie-env.sh to the ipaddres of the server >>> instead of the hostname and restart the oozie server >>> export OOZIE_HTTP_HOSTNAME=<server ip> >>> >>> >>> >>> >>> On Fri, Jun 27, 2014 at 10:19 AM, Jonathan Hodges <[email protected]> >>> wrote: >>> >>> > Hi, >>> > >>> > We have noticed some pauses between execution of action steps in our >>> Oozie >>> > workflows. A job may take a couple minutes yet the next step isn't >>> > triggered for sometimes ten minutes. Is there a way to control this >>> pause? >>> > We have a set of jobs that when executed at a command-line runs for >>> about >>> > 45 mins. The same set of jobs in Oozie is over 3 hours. >>> > >>> > We tried setting the following properties, but we aren't using >>> coordinators >>> > so it didn't help. >>> > >>> > oozie.service.CoordMaterializeTriggerService.lookup.interval=30 >>> > >>> oozie.service.CoordMaterializeTriggerService.materialization.window=7200 >>> > >>> > Is there maybe similar properties at the workflow level we can set? >>> > >>> > Thanks so much, >>> > >>> > Jonathan >>> > >>> >> >> >
