Fixed this; it was a matter of updating /etc/dhcp/dhclient.conf with "supercede domainname real.domain.com" to override the buggy one in resolve.conf. Not a hadoop/oozie issue at all, just a newbie Ubuntu problem.
Now it's working and I get the task finished callbacks without any issues. Thanks! ________________________________________ From: Jess Sheneberger Sent: Wednesday, October 17, 2012 2:31 PM To: [email protected]; Mohammad Islam Subject: RE: callbacks not happening, short jobs in RUNNING state for 10 min Thanks Mohammad, I found the exception in the tracker log, and it was failing to resolve the domain name because I had removed it from the hosts file trying to troubleshoot another issue. So this may be way out of scope for you and/or this list, but where do Oozie/Hadoop get the full domain name of the local machine? My /etc/hostname specifies only the short name and my network's DNS is wrong--is there a place I can override this? ________________________________________ From: Mohammad Islam [[email protected]] Sent: Wednesday, October 17, 2012 12:28 PM To: [email protected] Subject: Re: callbacks not happening, short jobs in RUNNING state for 10 min Hi Jess, Your analysis is correct. If you never received any Callback for job id 0000014-121016184312009-oozie-oozi-W,most possibly, hadoop has some issue. What version of hadoop are you using? Is it secured hadoop? Can you please check the job tracker log for that time frame around. Some relevant messages might be there. One bad callback could slow down the all callbacks from JT. You might even receive your callback after few hours due to late delivery from JT. JT currently using a single thread for dispatching all the callbacks. Regards, Mohammad ________________________________ From: Jess Sheneberger <[email protected]> To: "[email protected]" <[email protected]> Sent: Wednesday, October 17, 2012 11:03 AM Subject: callbacks not happening, short jobs in RUNNING state for 10 min Hi, I'm trying out Oozie for the first time, and when I first started running the examples they'd complete fairly quickly, and now they're taking a long time (10+ minutes) to complete. I think the callbacks aren't working, because in the first few runs I can see a job log entry for CallbackServlet, but I don't see this on my most recent job runs. It looks like the action (shell, java, etc) from the example that reads arguments and writes back to stdout is running quickly, and then Oozie leaves the job in RUNNING state for 10 minutes until it polls it. Any idea what could be messing up the callback? In the first few runs I saw this in the job log, just a few seconds after the job transistioned to RUNNING: 2012-10-16 19:35:45,063 INFO CallbackServlet:539 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000002-121016184312009-oozie-oozi-W] ACTION[0000002-121016184312009-oozie-oozi-W@shell1] callback for action [0000002-121016184312009-oozie-oozi-W@shell1] Now, after the job transistions to RUNNING, I see this about 10 minutes later: 2012-10-17 11:52:29,640 INFO JavaActionExecutor:539 - USER[jess] GROUP[-] TOKEN[] APP[java-main-wf] JOB[0000014-121016184312009-oozie-oozi-W] ACTION[0000014-121016184312009-oozie-oozi-W@java-node] action completed, external ID [job_201210161828_0013] Which must be the poller kicking in and realizing the task has completed, but why isn't the callback happening? How can I troubleshoot this? Thanks Jess
