I am sorry to keep reviving this issue but even after rolling out this fix
and confirming that launchers and actions are routed to separate pools
(verified on the fairscheduler page) we are still able to deadlock Oozie
after a set number of jobs are submitted. As soon as I kill do a 'hadoop
job -kill <workflow-id>' on a set number of the active workflow ids
everything just starts working again as if there were no issues. I am now
starting to wonder if this issue is more on the side of the fair scheduler
/ jobtracker than Oozie but overall I running out of ideas.

We are currently running about 32 pools in our fairscheduler config and the
general statistics are below:
- Our total capacity is roughly 250+ mappers and 100+ reducers
- Most pools have the default weight, 2 min mappers and 84 max mappers
- The Oozie action pool has a weight of 4, 100 min mappers, 200 max mappers
and 200 max concurrent jobs
- The Oozie launcher pool has a weight of 2, 100 min mappers, 200 max
mappers and 200 max concurrent jobs

Does anyone see any issues with this setup? Is there any reason why given
this config neither one of those pools can hit the total cap specified?

Thank you again for any suggestions and as always if you guys want any more
detailed information (logs, workflow descriptions, etc) I am more than
happy to provide them.

--
Matt

Reply via email to