I'll give that a try on your morning. Thanks. On Mon, Jul 27, 2015, 6:02 PM Josh Wills <[email protected]> wrote:
> Hey David, > > The easiest way is to insert a PCollection.cache() call at the stage > between the two joins where you think the reduce phase should end and the > next map phase should begin. When the Crunch planner makes the decision of > where to split the work between a reducer/mapper, it tries to respect any > explicit cache() calls that it encounters. > > Josh > > On Mon, Jul 27, 2015 at 2:58 PM, David Ortiz <[email protected]> > wrote: > >> Hey, >> >> >> >> Are there any easy tricks to force a new map stage to kick off? I >> know I can force a reduce with GBK operations, but I am running into an >> issue where one of our jobs is having issues with data skew, and from what >> I can tell, the issue is we are getting a couple hot keys that join >> properly, but then when trying to do the follow up processing that comes >> before the next join, the reducer hits the GC Overhead Limit. Based on the >> dot file, it is trying to do all the preprocessing for the next join in the >> reducer from the first join, but it could easily do it in the map phase >> before the next join in the pipeline without any issues, and I think this >> would also get past the issue we’re having with memory. The only solution >> I could think of to try and do this at the moment, is to do everything up >> to the first join, call pipeline.done(), then add some more operations >> before another pipeline.done() operation. >> >> >> >> Thanks, >> >> Dave >> *This email is intended only for the use of the individual(s) to whom >> it is addressed. If you have received this communication in error, please >> immediately notify the sender and delete the original email.* >> > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> >
