Matt, When an Oozie job starts, the Launcher which is a map-only job, occupies a map slot. Now, the jobtracker/resource-manager does a calculation of maximum allowable M-R slots per user, which is a function of your cluster size. We had come across this scenario once with a small 7-node cluster where Launcher map-tasks themselves filled up the resource slot quota and caused deadlock. I believe this issue looks similar. Can you try bumping up the resource availability in the Capacity-scheduler.xml? -- Mona Chitnis
On 10/9/12 5:00 PM, "Matt Goeke" <[email protected]> wrote: >All, > >We have a nightly ETL process that has 80+ workflows associated with it >all >staged through coordinators. As of right now I have to throttle the >start-up of groups of workflows across an 8 hour period but I would love >to >just let them all run at the same time. The issue I run into is that up to >a certain number of workflows running all of the nodes transition >perfectly >but as soon as I cross a threshold of X number of jobs running (I haven't >had the time to figure out the exacts yet but it is around 10-15) it is >almost as if I hit a resource deadlock within Oozie. Essentially all of >the >node transitions freeze, even something as simple as a conditional block, >all of the MR jobs associated with the workflows either sit in a state of >100% or 0% (in the case of 0% a job has been staged but the map task has >no >log associated with it) and new jobs can be staged but no transition will >occur. It can sit in this state indefinitely until I kill off some of the >workflows and once I get under that magic threshold everything starts back >up and transitions occur again as if nothing ever happened. > >In an effort to figure out what is going on I have tried putting it into >this state and looked at many different things: >1) external resource contention >2) resource contention on the box it is staged on (including DB >connections >to the MySQL instance that houses the oozie schema) >3) JMX data from the Oozie server >4) JobTracker/FairScheduler pool properties >5) log output found in /var/log/oozie/ >and none of these indicate anything deadlocked or any resources being >capped. > >I am to the point where next steps would be to do source diving / turn >debug on on the tomcat and try to set remote breakpoints but I would love >to see if anyone has any ideas on tweaks I can try first. I do know that >from the logs it seems as if threads are still actively checking to see if >jobs have completed (JMX stacktraces seem to indicate the same thing) so >it >would almost seem as if there is some live locking mechanism that is being >hit where job callbacks not able to be processed. > >We have a work around atm but obviously it is because of a virtual >limitation and not some external resource limitation so I would love to >know if this can be fixed. Logs, stack traces, oozie-site properties and >pretty much anything else can be provided if need be to help iron out what >is going on. > >-- >Matt
