Sorry I didn't get back to you earlier. I'm glad you were able to figure out what was blocking the callback URL on your own though!
On Mon, Jun 30, 2014 at 1:34 PM, Jonathan Hodges <[email protected]> wrote: > Okay the callback URL was definitely the problem. The ec2 hostname of the > Oozie server wasn't resolvable which was causing the issue. Since we have > multiple Oozie server nodes we will set OOZIE_HTTP_HOSTNAME to use a > VIP/ELB in the oozie-sys.sh. > > Thanks so much for the help! > > > > On Mon, Jun 30, 2014 at 12:04 PM, Jonathan Hodges <[email protected]> > wrote: > > > Is there a way I can determine the exact callback URL to verify > > connectivity from each of the Hadoop nodes? I guess maybe the actions > are > > getting suspended until callback URL is called or manual check every ten > > minutes as mentioned. > > > > > > > > On Sat, Jun 28, 2014 at 9:50 AM, Jonathan Hodges <[email protected]> > > wrote: > > > >> Thanks for the replies. I am seeing the following exception in the logs > >> during callback. > >> > >> > >> 2014-06-28 15:14:46,048 INFO CallbackServlet:539 - USER[-] GROUP[-] > >> TOKEN[-] APP[-] JOB[0000000-140628150326547-oozie-hado-W] > >> ACTION[0000000-140628150326547-oozie-hado-W@person-action] callback for > >> action [0000000-140628150326547-oozie-hado-W@person-action] > >> 2014-06-28 15:14:46,062 ERROR CompletedActionXCommand:536 - USER[-] > >> GROUP[-] TOKEN[] APP[-] JOB[0000000-140628150326547-oozie-hado-W] > >> ACTION[0000000-140628150326547-oozie-hado-W@person-action] XException, > >> org.apache.oozie.command.CommandException: E0800: Action it is not > >> running its in [PREP] state, action > >> [0000000-140628150326547-oozie-hado-W@person-action] > >> at > >> > org.apache.oozie.command.wf.CompletedActionXCommand.eagerVerifyPrecondition(CompletedActionXCommand.java:85) > >> at org.apache.oozie.command.XCommand.call(XCommand.java:251) > >> at > >> > org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:175) > >> at > >> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > >> at > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > >> at java.lang.Thread.run(Thread.java:662) > >> 2014-06-28 15:14:46,062 WARN CallableQueueService$CallableWrapper:542 - > >> USER[-] GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] exception callable > >> [callback], E0800: Action it is not running its in [PREP] state, action > >> [0000000-140628150326547-oozie-hado-W@person-action] > >> org.apache.oozie.command.CommandException: E0800: Action it is not > >> running its in [PREP] state, action > >> [0000000-140628150326547-oozie-hado-W@person-action] > >> at > >> > org.apache.oozie.command.wf.CompletedActionXCommand.eagerVerifyPrecondition(CompletedActionXCommand.java:85) > >> at org.apache.oozie.command.XCommand.call(XCommand.java:251) > >> at > >> > org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:175) > >> at > >> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > >> at > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > >> at java.lang.Thread.run(Thread.java:662) > >> 2014-06-28 15:14:46,076 WARN ActionStartXCommand:542 - USER[hadoop] > >> GROUP[-] TOKEN[] APP[main-job-wf] > JOB[0000000-140628150326547-oozie-hado-W] > >> ACTION[0000000-140628150326547-oozie-hado-W@person-action] > >> [***0000000-140628150326547-oozie-hado-W@person-action***]Action > >> status=RUNNING > >> 2014-06-28 15:14:46,076 WARN ActionStartXCommand:542 - USER[hadoop] > >> GROUP[-] TOKEN[] APP[main-job-wf] > JOB[0000000-140628150326547-oozie-hado-W] > >> ACTION[0000000-140628150326547-oozie-hado-W@person-action] > >> [***0000000-140628150326547-oozie-hado-W@person-action***]Action > updated > >> in DB! > >> 2014-06-28 15:14:46,095 INFO ActionStartXCommand:539 - USER[hadoop] > >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf] > >> JOB[0000004-140628150326547-oozie-hado-W] > >> ACTION[0000004-140628150326547-oozie-hado-W@:start:] Start action > >> [0000004-140628150326547-oozie-hado-W@:start:] with user-retry state : > >> userRetryCount [0], userRetryMax [0], userRetryInterval [10] > >> 2014-06-28 15:14:46,095 WARN ActionStartXCommand:542 - USER[hadoop] > >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf] > >> JOB[0000004-140628150326547-oozie-hado-W] > >> ACTION[0000004-140628150326547-oozie-hado-W@:start:] > >> [***0000004-140628150326547-oozie-hado-W@:start:***]Action status=DONE > >> 2014-06-28 15:14:46,095 WARN ActionStartXCommand:542 - USER[hadoop] > >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf] > >> JOB[0000004-140628150326547-oozie-hado-W] > >> ACTION[0000004-140628150326547-oozie-hado-W@:start:] > >> [***0000004-140628150326547-oozie-hado-W@:start:***]Action updated in > DB! > >> > >> > >> > >> However the pause doesn't seem to happen in the callback. It seems to > >> have something to do with the actions getting suspended as shown in the > >> following log entry. > >> > >> > >> > >> 2014-06-28 15:14:51,436 INFO JavaActionExecutor:539 - USER[hadoop] > >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf] > >> JOB[0000004-140628150326547-oozie-hado-W] > >> ACTION[0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob > ] > >> checking action, external ID [job_201406260453_0554] status [RUNNING] > >> 2014-06-28 15:14:51,439 WARN ActionStartXCommand:542 - USER[hadoop] > >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf] > >> JOB[0000004-140628150326547-oozie-hado-W] > >> ACTION[0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob > ] > >> [***0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob > ***]Action > >> status=RUNNING > >> 2014-06-28 15:14:51,440 WARN ActionStartXCommand:542 - USER[hadoop] > >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf] > >> JOB[0000004-140628150326547-oozie-hado-W] > >> ACTION[0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob > ] > >> [***0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob > ***]Action > >> updated in DB! > >> 2014-06-28 15:15:42,976 WARN ActionCheckXCommand:542 - USER[hadoop] > >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf] > >> JOB[0000011-140515062648271-oozie-hado-W] > >> ACTION[0000011-140515062648271-oozie-hado-W@personEnrichmentJob] E0818: > >> Action [0000011-140515062648271-oozie-hado-W@personEnrichmentJob] > status > >> is running but WF Job [0000011-140515062648271-oozie-hado-W] status is > >> [SUSPENDED]. Expected status is RUNNING., Error Code: E0818 > >> > >> > >> Looking at the Hadoop jobtracker UI these jobs completed successfully > >> at Sat Jun 28 15:15:52 UTC 2014 which seems to be right around the > >> suspended messages. These actions stay suspended for another ten > minutes > >> before kicking off the next set of actions. > >> > >> > >> 2014-06-28 15:25:43,603 INFO JavaActionExecutor:539 - USER[hadoop] > >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf] > >> JOB[0000004-140628150326547-oozie-hado-W] > >> ACTION[0000004-140628150326547-oozie-hado-W@personEmailEnrichmentJob] > >> action completed, external ID [job_201406260453_0553] > >> 2014-06-28 15:25:43,607 INFO JavaActionExecutor:539 - USER[hadoop] > >> GROUP[-] TOKEN[] APP[person-enrichment-job-wf] > >> JOB[0000004-140628150326547-oozie-hado-W] > >> ACTION[0000004-140628150326547-oozie-hado-W@personEmailEnrichmentJob] > >> action produced output > >> > >> > >> > >> Is this still related to the callback issue you mention or is this > >> something else with the suspended actions? FWIW I am able to connect to > >> the Oozie servers from all the nodes in the Hadoop cluster. > >> > >> Thanks for the help! > >> > >> Jonathan > >> > >> > >> > >> > >> > >> On Fri, Jun 27, 2014 at 11:29 AM, Amit Patil < > [email protected] > >> > wrote: > >> > >>> I think the callback url is not reachable by all the nodes on your > hadoop > >>> cluster. Since, the callback url fails, there is no way for the workers > >>> to > >>> notify the oozie server that the action has finished execution. After > 10 > >>> mins, the oozie server polls the worker node to get the status. Thats > why > >>> you are seeing 10 mins pauses. > >>> Set the following variable in oozie-env.sh to the ipaddres of the > server > >>> instead of the hostname and restart the oozie server > >>> export OOZIE_HTTP_HOSTNAME=<server ip> > >>> > >>> > >>> > >>> > >>> On Fri, Jun 27, 2014 at 10:19 AM, Jonathan Hodges <[email protected]> > >>> wrote: > >>> > >>> > Hi, > >>> > > >>> > We have noticed some pauses between execution of action steps in our > >>> Oozie > >>> > workflows. A job may take a couple minutes yet the next step isn't > >>> > triggered for sometimes ten minutes. Is there a way to control this > >>> pause? > >>> > We have a set of jobs that when executed at a command-line runs for > >>> about > >>> > 45 mins. The same set of jobs in Oozie is over 3 hours. > >>> > > >>> > We tried setting the following properties, but we aren't using > >>> coordinators > >>> > so it didn't help. > >>> > > >>> > oozie.service.CoordMaterializeTriggerService.lookup.interval=30 > >>> > > >>> > oozie.service.CoordMaterializeTriggerService.materialization.window=7200 > >>> > > >>> > Is there maybe similar properties at the workflow level we can set? > >>> > > >>> > Thanks so much, > >>> > > >>> > Jonathan > >>> > > >>> > >> > >> > > >
