Is there a way I can determine the exact callback URL to verify connectivity from each of the Hadoop nodes? I guess maybe the actions are getting suspended until callback URL is called or manual check every ten minutes as mentioned.
On Sat, Jun 28, 2014 at 9:50 AM, Jonathan Hodges <[email protected]> wrote: > Thanks for the replies. I am seeing the following exception in the logs > during callback. > > > 2014-06-28 15:14:46,048 INFO CallbackServlet:539 - USER[-] GROUP[-] > TOKEN[-] APP[-] JOB[0000000-140628150326547-oozie-hado-W] > ACTION[0000000-140628150326547-oozie-hado-W@person-action] callback for > action [0000000-140628150326547-oozie-hado-W@person-action] > 2014-06-28 15:14:46,062 ERROR CompletedActionXCommand:536 - USER[-] > GROUP[-] TOKEN[] APP[-] JOB[0000000-140628150326547-oozie-hado-W] > ACTION[0000000-140628150326547-oozie-hado-W@person-action] XException, > org.apache.oozie.command.CommandException: E0800: Action it is not running > its in [PREP] state, action > [0000000-140628150326547-oozie-hado-W@person-action] > at > org.apache.oozie.command.wf.CompletedActionXCommand.eagerVerifyPrecondition(CompletedActionXCommand.java:85) > at org.apache.oozie.command.XCommand.call(XCommand.java:251) > at > org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:175) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > 2014-06-28 15:14:46,062 WARN CallableQueueService$CallableWrapper:542 - > USER[-] GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] exception callable > [callback], E0800: Action it is not running its in [PREP] state, action > [0000000-140628150326547-oozie-hado-W@person-action] > org.apache.oozie.command.CommandException: E0800: Action it is not running > its in [PREP] state, action > [0000000-140628150326547-oozie-hado-W@person-action] > at > org.apache.oozie.command.wf.CompletedActionXCommand.eagerVerifyPrecondition(CompletedActionXCommand.java:85) > at org.apache.oozie.command.XCommand.call(XCommand.java:251) > at > org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:175) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > 2014-06-28 15:14:46,076 WARN ActionStartXCommand:542 - USER[hadoop] > GROUP[-] TOKEN[] APP[main-job-wf] JOB[0000000-140628150326547-oozie-hado-W] > ACTION[0000000-140628150326547-oozie-hado-W@person-action] > [***0000000-140628150326547-oozie-hado-W@person-action***]Action > status=RUNNING > 2014-06-28 15:14:46,076 WARN ActionStartXCommand:542 - USER[hadoop] > GROUP[-] TOKEN[] APP[main-job-wf] JOB[0000000-140628150326547-oozie-hado-W] > ACTION[0000000-140628150326547-oozie-hado-W@person-action] > [***0000000-140628150326547-oozie-hado-W@person-action***]Action updated > in DB! > 2014-06-28 15:14:46,095 INFO ActionStartXCommand:539 - USER[hadoop] > GROUP[-] TOKEN[] APP[person-enrichment-job-wf] > JOB[0000004-140628150326547-oozie-hado-W] > ACTION[0000004-140628150326547-oozie-hado-W@:start:] Start action > [0000004-140628150326547-oozie-hado-W@:start:] with user-retry state : > userRetryCount [0], userRetryMax [0], userRetryInterval [10] > 2014-06-28 15:14:46,095 WARN ActionStartXCommand:542 - USER[hadoop] > GROUP[-] TOKEN[] APP[person-enrichment-job-wf] > JOB[0000004-140628150326547-oozie-hado-W] > ACTION[0000004-140628150326547-oozie-hado-W@:start:] > [***0000004-140628150326547-oozie-hado-W@:start:***]Action status=DONE > 2014-06-28 15:14:46,095 WARN ActionStartXCommand:542 - USER[hadoop] > GROUP[-] TOKEN[] APP[person-enrichment-job-wf] > JOB[0000004-140628150326547-oozie-hado-W] > ACTION[0000004-140628150326547-oozie-hado-W@:start:] > [***0000004-140628150326547-oozie-hado-W@:start:***]Action updated in DB! > > > > However the pause doesn't seem to happen in the callback. It seems to > have something to do with the actions getting suspended as shown in the > following log entry. > > > > 2014-06-28 15:14:51,436 INFO JavaActionExecutor:539 - USER[hadoop] > GROUP[-] TOKEN[] APP[person-enrichment-job-wf] > JOB[0000004-140628150326547-oozie-hado-W] > ACTION[0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob] > checking action, external ID [job_201406260453_0554] status [RUNNING] > 2014-06-28 15:14:51,439 WARN ActionStartXCommand:542 - USER[hadoop] > GROUP[-] TOKEN[] APP[person-enrichment-job-wf] > JOB[0000004-140628150326547-oozie-hado-W] > ACTION[0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob] > [***0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob***]Action > status=RUNNING > 2014-06-28 15:14:51,440 WARN ActionStartXCommand:542 - USER[hadoop] > GROUP[-] TOKEN[] APP[person-enrichment-job-wf] > JOB[0000004-140628150326547-oozie-hado-W] > ACTION[0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob] > [***0000004-140628150326547-oozie-hado-W@personTelephoneEnrichmentJob***]Action > updated in DB! > 2014-06-28 15:15:42,976 WARN ActionCheckXCommand:542 - USER[hadoop] > GROUP[-] TOKEN[] APP[person-enrichment-job-wf] > JOB[0000011-140515062648271-oozie-hado-W] > ACTION[0000011-140515062648271-oozie-hado-W@personEnrichmentJob] E0818: > Action [0000011-140515062648271-oozie-hado-W@personEnrichmentJob] status > is running but WF Job [0000011-140515062648271-oozie-hado-W] status is > [SUSPENDED]. Expected status is RUNNING., Error Code: E0818 > > > Looking at the Hadoop jobtracker UI these jobs completed successfully > at Sat Jun 28 15:15:52 UTC 2014 which seems to be right around the > suspended messages. These actions stay suspended for another ten minutes > before kicking off the next set of actions. > > > 2014-06-28 15:25:43,603 INFO JavaActionExecutor:539 - USER[hadoop] > GROUP[-] TOKEN[] APP[person-enrichment-job-wf] > JOB[0000004-140628150326547-oozie-hado-W] > ACTION[0000004-140628150326547-oozie-hado-W@personEmailEnrichmentJob] > action completed, external ID [job_201406260453_0553] > 2014-06-28 15:25:43,607 INFO JavaActionExecutor:539 - USER[hadoop] > GROUP[-] TOKEN[] APP[person-enrichment-job-wf] > JOB[0000004-140628150326547-oozie-hado-W] > ACTION[0000004-140628150326547-oozie-hado-W@personEmailEnrichmentJob] > action produced output > > > > Is this still related to the callback issue you mention or is this > something else with the suspended actions? FWIW I am able to connect to > the Oozie servers from all the nodes in the Hadoop cluster. > > Thanks for the help! > > Jonathan > > > > > > On Fri, Jun 27, 2014 at 11:29 AM, Amit Patil <[email protected]> > wrote: > >> I think the callback url is not reachable by all the nodes on your hadoop >> cluster. Since, the callback url fails, there is no way for the workers to >> notify the oozie server that the action has finished execution. After 10 >> mins, the oozie server polls the worker node to get the status. Thats why >> you are seeing 10 mins pauses. >> Set the following variable in oozie-env.sh to the ipaddres of the server >> instead of the hostname and restart the oozie server >> export OOZIE_HTTP_HOSTNAME=<server ip> >> >> >> >> >> On Fri, Jun 27, 2014 at 10:19 AM, Jonathan Hodges <[email protected]> >> wrote: >> >> > Hi, >> > >> > We have noticed some pauses between execution of action steps in our >> Oozie >> > workflows. A job may take a couple minutes yet the next step isn't >> > triggered for sometimes ten minutes. Is there a way to control this >> pause? >> > We have a set of jobs that when executed at a command-line runs for >> about >> > 45 mins. The same set of jobs in Oozie is over 3 hours. >> > >> > We tried setting the following properties, but we aren't using >> coordinators >> > so it didn't help. >> > >> > oozie.service.CoordMaterializeTriggerService.lookup.interval=30 >> > oozie.service.CoordMaterializeTriggerService.materialization.window=7200 >> > >> > Is there maybe similar properties at the workflow level we can set? >> > >> > Thanks so much, >> > >> > Jonathan >> > >> > >
