Matt,

When an Oozie job starts, the Launcher which is a map-only job, occupies a
map slot. Now, the jobtracker/resource-manager does a calculation of
maximum allowable M-R slots per user, which is a function of your cluster
size. We had come across this scenario once with a small 7-node cluster
where Launcher map-tasks themselves filled up the resource slot quota and
caused deadlock. I believe this issue looks similar. Can you try bumping
up the resource availability in the Capacity-scheduler.xml?
--
Mona Chitnis




On 10/9/12 5:00 PM, "Matt Goeke" <[email protected]> wrote:

>All,
>
>We have a nightly ETL process that has 80+ workflows associated with it
>all
>staged through coordinators. As of right now I have to throttle the
>start-up of groups of workflows across an 8 hour period but I would love
>to
>just let them all run at the same time. The issue I run into is that up to
>a certain number of workflows running all of the nodes transition
>perfectly
>but as soon as I cross a threshold of X number of jobs running (I haven't
>had the time to figure out the exacts yet but it is around 10-15) it is
>almost as if I hit a resource deadlock within Oozie. Essentially all of
>the
>node transitions freeze, even something as simple as a conditional block,
>all of the MR jobs associated with the workflows either sit in a state of
>100% or 0% (in the case of 0% a job has been staged but the map task has
>no
>log associated with it) and new jobs can be staged but no transition will
>occur. It can sit in this state indefinitely until I kill off some of the
>workflows and once I get under that magic threshold everything starts back
>up and transitions occur again as if nothing ever happened.
>
>In an effort to figure out what is going on I have tried putting it into
>this state and looked at many different things:
>1) external resource contention
>2) resource contention on the box it is staged on (including DB
>connections
>to the MySQL instance that houses the oozie schema)
>3) JMX data from the Oozie server
>4) JobTracker/FairScheduler pool properties
>5) log output found in /var/log/oozie/
>and none of these indicate anything deadlocked or any resources being
>capped.
>
>I am to the point where next steps would be to do source diving / turn
>debug on on the tomcat and try to set remote breakpoints but I would love
>to see if anyone has any ideas on tweaks I can try first. I do know that
>from the logs it seems as if threads are still actively checking to see if
>jobs have completed (JMX stacktraces seem to indicate the same thing) so
>it
>would almost seem as if there is some live locking mechanism that is being
>hit where job callbacks not able to be processed.
>
>We have a work around atm but obviously it is because of a virtual
>limitation and not some external resource limitation so I would love to
>know if this can be fixed. Logs, stack traces, oozie-site properties and
>pretty much anything else can be provided if need be to help iron out what
>is going on.
>
>--
>Matt

Reply via email to