Re: Oozie jobs get stuck in RUNNING state indefinitely

Matt Goeke Tue, 09 Oct 2012 17:53:36 -0700

Alejandro and Mona,

Thank you for the quick response! One thing that I did do through the
course of our my testing was to kick off all of the jobs and then move half
of them to a separate pool manually through the fair scheduler page to see
if it was a pool resource conflict. It didn't refill up to the user cap for
the pool (allowing for anymore jobs to be kicked off in the initial pool)
after I did that but I am still happy to at least try bumping the cap to
see if I can raise that threshold.


I'll let you know the result after I change test that.

--
Matt

On Tue, Oct 9, 2012 at 7:37 PM, Alejandro Abdelnur <[email protected]>wrote:

> It looks like a the cluster/queue capacity is being exceeded.
>
> Adding to Mona's answer, you could configure oozie launcher jobs to
> run in their own scheduler queue, thus not competing with regular jobs
> for slots.
>
> Thx
>
> On Tue, Oct 9, 2012 at 5:26 PM, Mona Chitnis <[email protected]>
> wrote:
> > Matt,
> >
> > When an Oozie job starts, the Launcher which is a map-only job, occupies
> a
> > map slot. Now, the jobtracker/resource-manager does a calculation of
> > maximum allowable M-R slots per user, which is a function of your cluster
> > size. We had come across this scenario once with a small 7-node cluster
> > where Launcher map-tasks themselves filled up the resource slot quota and
> > caused deadlock. I believe this issue looks similar. Can you try bumping
> > up the resource availability in the Capacity-scheduler.xml?
> > --
> > Mona Chitnis
> >
> >
> >
> >
> > On 10/9/12 5:00 PM, "Matt Goeke" <[email protected]> wrote:
> >
> >>All,
> >>
> >>We have a nightly ETL process that has 80+ workflows associated with it
> >>all
> >>staged through coordinators. As of right now I have to throttle the
> >>start-up of groups of workflows across an 8 hour period but I would love
> >>to
> >>just let them all run at the same time. The issue I run into is that up
> to
> >>a certain number of workflows running all of the nodes transition
> >>perfectly
> >>but as soon as I cross a threshold of X number of jobs running (I haven't
> >>had the time to figure out the exacts yet but it is around 10-15) it is
> >>almost as if I hit a resource deadlock within Oozie. Essentially all of
> >>the
> >>node transitions freeze, even something as simple as a conditional block,
> >>all of the MR jobs associated with the workflows either sit in a state of
> >>100% or 0% (in the case of 0% a job has been staged but the map task has
> >>no
> >>log associated with it) and new jobs can be staged but no transition will
> >>occur. It can sit in this state indefinitely until I kill off some of the
> >>workflows and once I get under that magic threshold everything starts
> back
> >>up and transitions occur again as if nothing ever happened.
> >>
> >>In an effort to figure out what is going on I have tried putting it into
> >>this state and looked at many different things:
> >>1) external resource contention
> >>2) resource contention on the box it is staged on (including DB
> >>connections
> >>to the MySQL instance that houses the oozie schema)
> >>3) JMX data from the Oozie server
> >>4) JobTracker/FairScheduler pool properties
> >>5) log output found in /var/log/oozie/
> >>and none of these indicate anything deadlocked or any resources being
> >>capped.
> >>
> >>I am to the point where next steps would be to do source diving / turn
> >>debug on on the tomcat and try to set remote breakpoints but I would love
> >>to see if anyone has any ideas on tweaks I can try first. I do know that
> >>from the logs it seems as if threads are still actively checking to see
> if
> >>jobs have completed (JMX stacktraces seem to indicate the same thing) so
> >>it
> >>would almost seem as if there is some live locking mechanism that is
> being
> >>hit where job callbacks not able to be processed.
> >>
> >>We have a work around atm but obviously it is because of a virtual
> >>limitation and not some external resource limitation so I would love to
> >>know if this can be fixed. Logs, stack traces, oozie-site properties and
> >>pretty much anything else can be provided if need be to help iron out
> what
> >>is going on.
> >>
> >>--
> >>Matt
> >
>
>
>
> --
> Alejandro
>

Re: Oozie jobs get stuck in RUNNING state indefinitely

Reply via email to