I am sorry to keep reviving this issue but even after rolling out this fix and confirming that launchers and actions are routed to separate pools (verified on the fairscheduler page) we are still able to deadlock Oozie after a set number of jobs are submitted. As soon as I kill do a 'hadoop job -kill <workflow-id>' on a set number of the active workflow ids everything just starts working again as if there were no issues. I am now starting to wonder if this issue is more on the side of the fair scheduler / jobtracker than Oozie but overall I running out of ideas.
We are currently running about 32 pools in our fairscheduler config and the general statistics are below: - Our total capacity is roughly 250+ mappers and 100+ reducers - Most pools have the default weight, 2 min mappers and 84 max mappers - The Oozie action pool has a weight of 4, 100 min mappers, 200 max mappers and 200 max concurrent jobs - The Oozie launcher pool has a weight of 2, 100 min mappers, 200 max mappers and 200 max concurrent jobs Does anyone see any issues with this setup? Is there any reason why given this config neither one of those pools can hit the total cap specified? Thank you again for any suggestions and as always if you guys want any more detailed information (logs, workflow descriptions, etc) I am more than happy to provide them. -- Matt
