Hey,
Are there any easy tricks to force a new map stage to kick off? I know I
can force a reduce with GBK operations, but I am running into an issue where
one of our jobs is having issues with data skew, and from what I can tell, the
issue is we are getting a couple hot keys that join properly, but then when
trying to do the follow up processing that comes before the next join, the
reducer hits the GC Overhead Limit. Based on the dot file, it is trying to do
all the preprocessing for the next join in the reducer from the first join, but
it could easily do it in the map phase before the next join in the pipeline
without any issues, and I think this would also get past the issue we're having
with memory. The only solution I could think of to try and do this at the
moment, is to do everything up to the first join, call pipeline.done(), then
add some more operations before another pipeline.done() operation.
Thanks,
Dave
This email is intended only for the use of the individual(s) to whom it is
addressed. If you have received this communication in error, please immediately
notify the sender and delete the original email.