Hi Chris,
Most of them are know issues. Those are mostly taken care of in recent releases.
What is your Oozie version? I'm sure it is something released more than a year
back (by looking into the log output). I would recommend to use the latest 3.2
Oozie.
Also there is hadoop bug regarding the callback to oozie to report the hadoop
job's status. Most of the case, that is the reason of 10 minutes delay. You can
minimize it by reducing the property value to 5 minutes {say) from 10 minutes
in oozie-site.xml.
<property>
<name>oozie.service.ActionCheckerService.action.check.delay</name>
<value>300</value>
<description>
The time, in seconds, between an ActionCheck for the same action.
</description>
</property>
Regards,
Mohammad
________________________________
From: Chris White <[email protected]>
To: [email protected]
Sent: Thursday, November 15, 2012 4:21 PM
Subject: More Oozie 'deadlocking'
Similar to Andy Kartashov's recent posts on the hadoop user mailing list
('Oozie apparent concurrency deadlocking'), i am experiencing something
similar with my Oozie instance (CDH3u1). Recently the oozie server has
become very slow to respond (2+ mins) to job listings, job info requests
etc. Periods between actions completing and the next action executing has
also increased to 10s of minutes too (but requests for instrumentation /
config / queuedumps respond immediately).
Some background about my current state: I have around 20 coordinator jobs
all running, most of which are scheduled to run at 00:30, and operate on
similar input patterns (all data for the previous day). We had a data flow
outage recently and the input data (+parent folders) that the coord
instances depend on were not present for some 10 days, causing most of the
jobs to timeout (timeout for all jobs is 1 hour).
Examining the logs i see a number of things:
For months, my logs have been littered with exceptions similar to this:
CallableQueueService$CompositeCallable:528 - USER[-] GROUP[-] TOKEN APP JOB
ACTION exception callable [coord_action_ready], E0607: Other error in
operation [java.lang.NullPointerException], {1}
org.apache.oozie.command.CommandException: E0607: Other error in operation
[java.lang.NullPointerException], {1}
at o.a.o.command.Command.call(Command.java:273)
at
o.a.o.s.CallableQueueService$CompositeCallable.call(CallableQueueService.java:211)
at
o.a.o.s.CallableQueueService$CallableWrapper.run(CallableQueueService.java:128)
Caused by: java.lang.NullPointerException
at o.a.o.command.Command.setLogInfo(Command.java:427)
at
o.a.o.c.c.CoordActionReadyCommand.call(CoordActionReadyCommand.java:107)
at o.a.o.c.c.CoordActionReadyCommand.call(CoordActionReadyCommand.java:27)
at o.a.o.c.Command.call(Command.java:202)
More recently, i see endless listings of the following log entry (20/30 a
second):
WARN CallableQueueService:528 max concurrent for callable
[coord_action_input] exceeded, requeing with [500]ms delay
If i perform a queuedump (oozie admin -queuedump), i see ~150
coord_action_input entries with delays between -1 and 500, occasionally
into the 10,000's. Querying this service every 5 minutes, i appear to be
gaining a few new entries every couple of minutes (i've set up a script to
log the count every 5 mins to see what happens overnight)
I have tried to retrospectively fix up the missing paths in HDFS, hoping
that might resolve the queue, and i've also gone in and forcibly set the
status of all the coordinator jobs in the COORD_JOBS mysql table to
SUSPENDED. The server comes and goes, everything works for a while after a
restart but then it gets back into the 'non-responsive' state for anything
job related.
Short of truncating my COORD_* tables and starting from scratch, has anyone
got any insight as to what might be happening? Or what i can do to diagnose
further
Thanks
Chris