Mohammed,

CDH3u1, which i think relates to something based around 2.3.0 (Jul 2011). I
tried to run CDH3u4 version of the oozie tarball, but that has some
incompatibilities with the hadoop core (0.20.2, again CDH3u1). Are there
any JIRA ticket numbers that you can think of so i can try and manually
patch (assuming they are 'simple' fixes).

Thanks

Chris


On Thu, Nov 15, 2012 at 8:20 PM, Mohammad Islam <[email protected]> wrote:

> Hi Chris,
>
> Most of them are know issues. Those are mostly taken care of in recent
> releases.
>
> What is your Oozie version? I'm sure it is something released more than a
> year back (by looking into the log output). I would recommend to use the
> latest 3.2 Oozie.
>
> Also there is hadoop bug regarding the callback to oozie to report the
> hadoop job's status. Most of the case, that is the reason of 10 minutes
> delay. You can minimize it by reducing the property value to 5 minutes
> {say) from 10 minutes in oozie-site.xml.
>
>
> <property>
>         <name>oozie.service.ActionCheckerService.action.check.delay</name>
>         <value>300</value>
>         <description>
>             The time, in seconds, between an ActionCheck for the same
> action.
>         </description>
>     </property>
>
> Regards,
> Mohammad
>
>
> ________________________________
>  From: Chris White <[email protected]>
> To: [email protected]
> Sent: Thursday, November 15, 2012 4:21 PM
> Subject: More Oozie 'deadlocking'
>
> Similar to Andy Kartashov's recent posts on the hadoop user mailing list
> ('Oozie apparent concurrency deadlocking'), i am experiencing something
> similar with my Oozie instance (CDH3u1). Recently the oozie server has
> become very slow to respond (2+ mins) to job listings, job info requests
> etc. Periods between actions completing and the next action executing has
> also increased to 10s of minutes too (but requests for instrumentation /
> config / queuedumps respond immediately).
>
> Some background about my current state: I have around 20 coordinator jobs
> all running, most of which are scheduled to run at 00:30, and operate on
> similar input patterns (all data for the previous day). We had a data flow
> outage recently and the input data (+parent folders) that the coord
> instances depend on were not present for some 10 days, causing most of the
> jobs to timeout (timeout for all jobs is 1 hour).
>
> Examining the logs i see a number of things:
>
> For months, my logs have been littered with exceptions similar to this:
>
> CallableQueueService$CompositeCallable:528 - USER[-] GROUP[-] TOKEN APP JOB
> ACTION exception callable [coord_action_ready], E0607: Other error in
> operation [java.lang.NullPointerException], {1}
> org.apache.oozie.command.CommandException: E0607: Other error in operation
> [java.lang.NullPointerException], {1}
>   at o.a.o.command.Command.call(Command.java:273)
>   at
>
> o.a.o.s.CallableQueueService$CompositeCallable.call(CallableQueueService.java:211)
>   at
>
> o.a.o.s.CallableQueueService$CallableWrapper.run(CallableQueueService.java:128)
> Caused by: java.lang.NullPointerException
>   at o.a.o.command.Command.setLogInfo(Command.java:427)
>   at
> o.a.o.c.c.CoordActionReadyCommand.call(CoordActionReadyCommand.java:107)
>   at
> o.a.o.c.c.CoordActionReadyCommand.call(CoordActionReadyCommand.java:27)
>   at o.a.o.c.Command.call(Command.java:202)
>
> More recently, i see endless listings of the following log entry (20/30 a
> second):
>
> WARN CallableQueueService:528 max concurrent for callable
> [coord_action_input] exceeded, requeing with [500]ms delay
>
> If i perform a queuedump (oozie admin -queuedump), i see ~150
> coord_action_input entries with delays between -1 and 500, occasionally
> into the 10,000's. Querying this service every 5 minutes, i appear to be
> gaining a few new entries every couple of minutes (i've set up a script to
> log the count every 5 mins to see what happens overnight)
>
> I have tried to retrospectively fix up the missing paths in HDFS, hoping
> that might resolve the queue, and i've also gone in and forcibly set the
> status of all the coordinator jobs in the COORD_JOBS mysql table to
> SUSPENDED. The server comes and goes, everything works for a while after a
> restart but then it gets back into the 'non-responsive' state for anything
> job related.
>
> Short of truncating my COORD_* tables and starting from scratch, has anyone
> got any insight as to what might be happening? Or what i can do to diagnose
> further
>
> Thanks
>
> Chris
>

Reply via email to