Hi Steven,

Regarding the warning log msg WARN CallableQueueService:542 - USER[-]
GROUP[-] max concurrency for callable [action.check] exceeded, requeueing
with [500]ms delay
There is a default limit = 3 placed on max concurrency of 'same type' of
commands e.g. The types can be input-check, start, stop, kill etc.. This
is to make sure that the Oozie command queue does not become full by just
one type of commands thus, potentially starving the rest. You can set a
custom value for this parameter as follows if you wish to alleviate this
limit.

<property>
        
<name>oozie.service.CallableQueueService.callable.concurrency</name>
        <value>custom_value</value>
    </property>


Now, regarding your first concern about the job end notification timing
out, I am actually more concerned about this -
Unknown hadoop job [job_1352323593151_0067] associated with action
[0000009-121108121118092-oozie-oozi-W@users]


Has the Oozie server been restarted after submitting this particular job,
causing it to lose context of hadoop job-id?
--
Mona Chitnis




On 11/9/12 9:18 AM, "Steven Willis" <[email protected]> wrote:

>Hi,
>
>I was recently load testing our cluster after enabling Kerberos security:
>we're running yarn on 14 worker nodes, 2 ha-namenodes, with oozie running
>on one of the namenodes. I kicked off 25 oozie workflows that each
>contain 3 mapreduce jobs. The main MR job in each workflow, called
>pageviews, has about 500 - 800 map tasks and about 30 reduce tasks. The
>workflow contains two more MR jobs that depend on pageviews: users and
>sessions, they each have 30 map tasks and 1 reduce task they can run in
>parallel after pageviews finishes. All of the hadoop jobs for all the
>workflows completed successfully. However oozie shows that one of the
>workflows failed. The output of the oozie job log contains:
>
>2012-11-09 10:23:43,646  INFO CallbackServlet:539 - USER[-] GROUP[-]
>TOKEN[-] APP[-] JOB[0000009-121108121118092-oozie-oozi-W]
>ACTION[0000009-121108121118092-oozie-oozi-W@users] callback for action
>[0000009-121108121118092-oozie-oozi-W@users]
>2012-11-09 10:23:47,409  INFO CallbackServlet:539 - USER[-] GROUP[-]
>TOKEN[-] APP[-] JOB[0000009-121108121118092-oozie-oozi-W]
>ACTION[0000009-121108121118092-oozie-oozi-W@users] callback for action
>[0000009-121108121118092-oozie-oozi-W@users]
>2012-11-09 10:23:52,420  INFO CallbackServlet:539 - USER[-] GROUP[-]
>TOKEN[-] APP[-] JOB[0000009-121108121118092-oozie-oozi-W]
>ACTION[0000009-121108121118092-oozie-oozi-W@users] callback for action
>[0000009-121108121118092-oozie-oozi-W@users]
>2012-11-09 10:23:57,428  INFO CallbackServlet:539 - USER[-] GROUP[-]
>TOKEN[-] APP[-] JOB[0000009-121108121118092-oozie-oozi-W]
>ACTION[0000009-121108121118092-oozie-oozi-W@users] callback for action
>[0000009-121108121118092-oozie-oozi-W@users]
>2012-11-09 10:24:02,436  INFO CallbackServlet:539 - USER[-] GROUP[-]
>TOKEN[-] APP[-] JOB[0000009-121108121118092-oozie-oozi-W]
>ACTION[0000009-121108121118092-oozie-oozi-W@users] callback for action
>[0000009-121108121118092-oozie-oozi-W@users]
>2012-11-09 10:24:37,000  WARN MapReduceActionExecutor:542 - USER[mkeeney]
>GROUP[-] TOKEN[] APP[hpdp] JOB[0000009-121108121118092-oozie-oozi-W]
>ACTION[0000009-121108121118092-oozie-oozi-W@users] Exception in check().
>Message[JA017: Unknown hadoop job [job_1352323593151_0067] associated
>with action [0000009-121108121118092-oozie-oozi-W@users].  Failing this
>action!]
>org.apache.oozie.action.ActionExecutorException: JA017: Unknown hadoop
>job [job_1352323593151_0067] associated with action
>[0000009-121108121118092-oozie-oozi-W@users].  Failing this action!
>       at 
>org.apache.oozie.action.hadoop.JavaActionExecutor.check(JavaActionExecutor
>.java:891)
>       at 
>org.apache.oozie.command.wf.ActionCheckXCommand.execute(ActionCheckXComman
>d.java:162)
>       at 
>org.apache.oozie.command.wf.ActionCheckXCommand.execute(ActionCheckXComman
>d.java:51)
>       at org.apache.oozie.command.XCommand.call(XCommand.java:260)
>       at 
>org.apache.oozie.service.CallableQueueService$CallableWrapper.run(Callable
>QueueService.java:166)
>       at 
>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
>1110)
>       at 
>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
>:603)
>       at java.lang.Thread.run(Thread.java:679)
>
>And the syslog from the hadoop job in question contains:
>
>2012-11-09 10:23:35,491 INFO [AsyncDispatcher event handler]
>org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Job end notification
>started for jobID : job_1352323593151_0067
>2012-11-09 10:23:35,497 INFO [AsyncDispatcher event handler]
>org.mortbay.log: Job end notification attempts left 4
>2012-11-09 10:23:35,497 INFO [AsyncDispatcher event handler]
>org.mortbay.log: Job end notification trying
>http://NAMENODE1:11000/oozie/callback?id=0000009-121108121118092-oozie-ooz
>i-W@users&status=SUCCEEDED&
>[... snip other cleanup logs ...]
>2012-11-09 10:23:42,395 WARN [AsyncDispatcher event handler]
>org.mortbay.log: Job end notification to
>http://NAMENODE1:11000/oozie/callback?id=0000009-121108121118092-oozie-ooz
>i-W@users&status=SUCCEEDED& failed
>java.net.SocketTimeoutException: Read timed out
>       at java.net.SocketInputStream.socketRead0(Native Method)
>       at java.net.SocketInputStream.read(SocketInputStream.java:146)
>       at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>       at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
>       at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>       at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:688)
>       at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
>       at 
>sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnecti
>on.java:1162)
>       at 
> java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:397)
>       at 
>org.apache.hadoop.mapreduce.v2.app.JobEndNotifier.notifyURLOnce(JobEndNoti
>fier.java:127)
>       at 
>org.apache.hadoop.mapreduce.v2.app.JobEndNotifier.notify(JobEndNotifier.ja
>va:176)
>       at 
>org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler.handl
>e(MRAppMaster.java:405)
>       at 
>org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler.handl
>e(MRAppMaster.java:391)
>       at 
>org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java
>:125)
>       at 
>org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:74
>)
>       at java.lang.Thread.run(Thread.java:679)
>2012-11-09 10:23:42,399 INFO [AsyncDispatcher event handler]
>org.mortbay.log: Job end notification attempts left 3
>2012-11-09 10:23:42,399 INFO [AsyncDispatcher event handler]
>org.mortbay.log: Job end notification trying
>http://NAMENODE1:11000/oozie/callback?id=0000009-121108121118092-oozie-ooz
>i-W@users&status=SUCCEEDED&
>[... snip other cleanup logs ...]
>2012-11-09 10:23:47,406 WARN [AsyncDispatcher event handler]
>org.mortbay.log: Job end notification to
>http://NAMENODE1:11000/oozie/callback?id=0000009-121108121118092-oozie-ooz
>i-W@users&status=SUCCEEDED& failed
>[... snip stack trace ...]
>2012-11-09 10:23:47,409 INFO [AsyncDispatcher event handler]
>org.mortbay.log: Job end notification attempts left 2
>2012-11-09 10:23:47,409 INFO [AsyncDispatcher event handler]
>org.mortbay.log: Job end notification trying
>http://NAMENODE1:11000/oozie/callback?id=0000009-121108121118092-oozie-ooz
>i-W@users&status=SUCCEEDED&
>[... snip other cleanup logs ...]
>2012-11-09 10:23:52,416 WARN [AsyncDispatcher event handler]
>org.mortbay.log: Job end notification to
>http://NAMENODE1:11000/oozie/callback?id=0000009-121108121118092-oozie-ooz
>i-W@users&status=SUCCEEDED& failed
>[... snip stack trace ...]
>2012-11-09 10:23:52,418 INFO [AsyncDispatcher event handler]
>org.mortbay.log: Job end notification attempts left 1
>2012-11-09 10:23:52,418 INFO [AsyncDispatcher event handler]
>org.mortbay.log: Job end notification trying
>http://NAMENODE1:11000/oozie/callback?id=0000009-121108121118092-oozie-ooz
>i-W@users&status=SUCCEEDED&
>[... snip other cleanup logs ...]
>2012-11-09 10:23:57,425 WARN [AsyncDispatcher event handler]
>org.mortbay.log: Job end notification to
>http://NAMENODE1:11000/oozie/callback?id=0000009-121108121118092-oozie-ooz
>i-W@users&status=SUCCEEDED& failed
>[... snip stack trace ...]
>2012-11-09 10:23:57,427 INFO [AsyncDispatcher event handler]
>org.mortbay.log: Job end notification attempts left 0
>2012-11-09 10:23:57,428 INFO [AsyncDispatcher event handler]
>org.mortbay.log: Job end notification trying
>http://NAMENODE1:11000/oozie/callback?id=0000009-121108121118092-oozie-ooz
>i-W@users&status=SUCCEEDED&
>[... snip other cleanup logs ...]
>2012-11-09 10:24:02,435 WARN [AsyncDispatcher event handler]
>org.mortbay.log: Job end notification to
>http://NAMENODE1:11000/oozie/callback?id=0000009-121108121118092-oozie-ooz
>i-W@users&status=SUCCEEDED& failed
>[... snip stack trace ...]
>2012-11-09 10:24:02,437 WARN [AsyncDispatcher event handler]
>org.mortbay.log: Job end notification failed to notify :
>http://NAMENODE1:11000/oozie/callback?id=0000009-121108121118092-oozie-ooz
>i-W@users&status=SUCCEEDED&
>
>The timestamps on the timeouts in the hadoop job log seem to line up
>exactly with the "INFO CallbackServlet:539" in the oozie logs. I did
>notice that the oozie web interface and the various hadoop web interfaces
>seemed sluggish during this time. We've run more jobs than this before
>and never had this problem. I also noticed this line in the oozie.log:
>
>2012-11-09 10:24:17,523  WARN CallableQueueService:542 - USER[-] GROUP[-]
>max concurrency for callable [action.check] exceeded, requeueing with
>[500]ms delay
>
>We're running CDH4, with oozie at 3.1.3-cdh4.0.0 and hadoop at
>2.0.0-cdh4.0.0
>
>Any ideas?
>
>-Steve

Reply via email to