Alejandro and Mona, Thank you for the quick response! One thing that I did do through the course of our my testing was to kick off all of the jobs and then move half of them to a separate pool manually through the fair scheduler page to see if it was a pool resource conflict. It didn't refill up to the user cap for the pool (allowing for anymore jobs to be kicked off in the initial pool) after I did that but I am still happy to at least try bumping the cap to see if I can raise that threshold.
I'll let you know the result after I change test that. -- Matt On Tue, Oct 9, 2012 at 7:37 PM, Alejandro Abdelnur <[email protected]>wrote: > It looks like a the cluster/queue capacity is being exceeded. > > Adding to Mona's answer, you could configure oozie launcher jobs to > run in their own scheduler queue, thus not competing with regular jobs > for slots. > > Thx > > On Tue, Oct 9, 2012 at 5:26 PM, Mona Chitnis <[email protected]> > wrote: > > Matt, > > > > When an Oozie job starts, the Launcher which is a map-only job, occupies > a > > map slot. Now, the jobtracker/resource-manager does a calculation of > > maximum allowable M-R slots per user, which is a function of your cluster > > size. We had come across this scenario once with a small 7-node cluster > > where Launcher map-tasks themselves filled up the resource slot quota and > > caused deadlock. I believe this issue looks similar. Can you try bumping > > up the resource availability in the Capacity-scheduler.xml? > > -- > > Mona Chitnis > > > > > > > > > > On 10/9/12 5:00 PM, "Matt Goeke" <[email protected]> wrote: > > > >>All, > >> > >>We have a nightly ETL process that has 80+ workflows associated with it > >>all > >>staged through coordinators. As of right now I have to throttle the > >>start-up of groups of workflows across an 8 hour period but I would love > >>to > >>just let them all run at the same time. The issue I run into is that up > to > >>a certain number of workflows running all of the nodes transition > >>perfectly > >>but as soon as I cross a threshold of X number of jobs running (I haven't > >>had the time to figure out the exacts yet but it is around 10-15) it is > >>almost as if I hit a resource deadlock within Oozie. Essentially all of > >>the > >>node transitions freeze, even something as simple as a conditional block, > >>all of the MR jobs associated with the workflows either sit in a state of > >>100% or 0% (in the case of 0% a job has been staged but the map task has > >>no > >>log associated with it) and new jobs can be staged but no transition will > >>occur. It can sit in this state indefinitely until I kill off some of the > >>workflows and once I get under that magic threshold everything starts > back > >>up and transitions occur again as if nothing ever happened. > >> > >>In an effort to figure out what is going on I have tried putting it into > >>this state and looked at many different things: > >>1) external resource contention > >>2) resource contention on the box it is staged on (including DB > >>connections > >>to the MySQL instance that houses the oozie schema) > >>3) JMX data from the Oozie server > >>4) JobTracker/FairScheduler pool properties > >>5) log output found in /var/log/oozie/ > >>and none of these indicate anything deadlocked or any resources being > >>capped. > >> > >>I am to the point where next steps would be to do source diving / turn > >>debug on on the tomcat and try to set remote breakpoints but I would love > >>to see if anyone has any ideas on tweaks I can try first. I do know that > >>from the logs it seems as if threads are still actively checking to see > if > >>jobs have completed (JMX stacktraces seem to indicate the same thing) so > >>it > >>would almost seem as if there is some live locking mechanism that is > being > >>hit where job callbacks not able to be processed. > >> > >>We have a work around atm but obviously it is because of a virtual > >>limitation and not some external resource limitation so I would love to > >>know if this can be fixed. Logs, stack traces, oozie-site properties and > >>pretty much anything else can be provided if need be to help iron out > what > >>is going on. > >> > >>-- > >>Matt > > > > > > -- > Alejandro >
