It looks like a the cluster/queue capacity is being exceeded. Adding to Mona's answer, you could configure oozie launcher jobs to run in their own scheduler queue, thus not competing with regular jobs for slots.
Thx On Tue, Oct 9, 2012 at 5:26 PM, Mona Chitnis <[email protected]> wrote: > Matt, > > When an Oozie job starts, the Launcher which is a map-only job, occupies a > map slot. Now, the jobtracker/resource-manager does a calculation of > maximum allowable M-R slots per user, which is a function of your cluster > size. We had come across this scenario once with a small 7-node cluster > where Launcher map-tasks themselves filled up the resource slot quota and > caused deadlock. I believe this issue looks similar. Can you try bumping > up the resource availability in the Capacity-scheduler.xml? > -- > Mona Chitnis > > > > > On 10/9/12 5:00 PM, "Matt Goeke" <[email protected]> wrote: > >>All, >> >>We have a nightly ETL process that has 80+ workflows associated with it >>all >>staged through coordinators. As of right now I have to throttle the >>start-up of groups of workflows across an 8 hour period but I would love >>to >>just let them all run at the same time. The issue I run into is that up to >>a certain number of workflows running all of the nodes transition >>perfectly >>but as soon as I cross a threshold of X number of jobs running (I haven't >>had the time to figure out the exacts yet but it is around 10-15) it is >>almost as if I hit a resource deadlock within Oozie. Essentially all of >>the >>node transitions freeze, even something as simple as a conditional block, >>all of the MR jobs associated with the workflows either sit in a state of >>100% or 0% (in the case of 0% a job has been staged but the map task has >>no >>log associated with it) and new jobs can be staged but no transition will >>occur. It can sit in this state indefinitely until I kill off some of the >>workflows and once I get under that magic threshold everything starts back >>up and transitions occur again as if nothing ever happened. >> >>In an effort to figure out what is going on I have tried putting it into >>this state and looked at many different things: >>1) external resource contention >>2) resource contention on the box it is staged on (including DB >>connections >>to the MySQL instance that houses the oozie schema) >>3) JMX data from the Oozie server >>4) JobTracker/FairScheduler pool properties >>5) log output found in /var/log/oozie/ >>and none of these indicate anything deadlocked or any resources being >>capped. >> >>I am to the point where next steps would be to do source diving / turn >>debug on on the tomcat and try to set remote breakpoints but I would love >>to see if anyone has any ideas on tweaks I can try first. I do know that >>from the logs it seems as if threads are still actively checking to see if >>jobs have completed (JMX stacktraces seem to indicate the same thing) so >>it >>would almost seem as if there is some live locking mechanism that is being >>hit where job callbacks not able to be processed. >> >>We have a work around atm but obviously it is because of a virtual >>limitation and not some external resource limitation so I would love to >>know if this can be fixed. Logs, stack traces, oozie-site properties and >>pretty much anything else can be provided if need be to help iron out what >>is going on. >> >>-- >>Matt > -- Alejandro
