Yep, that's right-- can you file a JIRA, and I'll post the patch? On Mon, Oct 17, 2016 at 10:52 PM, 陈竞 <cj.mag...@gmail.com> wrote:
> i may found the root cause in my case: > > public void materializeAt(SourceTarget<S> sourceTarget) { > this.materializedAt = sourceTarget; > this.size = materializedAt.getSize(getPipeline().getConfiguration()); > } > > > @Override > public long getSize() { > if (size < 0) { > this.size = getSizeInternal(); > } > return size; > } > > PColletionImpl.materializeAt(sourceTarget) this method will be invoked > when node splits to create temporary table, source sourceTarget binds > with the new temporary table whose size is 0, since its path was just > created, the this.size will be 0. After that, when getSize() was invoked by > setting reduce number, since the size is 0, it will just return 0, which > makes reduce number too small. > > So i think the code of materializeAt() should check sourceTarget's size, like > below: > > public void materializeAt(SourceTarget<S> sourceTarget) { > this.materializedAt = sourceTarget; > long size = materializedAt.getSize(getPipeline().getConfiguration()); > > if (size > 0) > > this.size = size; > > } > > > > 2016-10-17 11:19 GMT+08:00 David Ortiz <dpo5...@gmail.com>: > >> That gets tricky if you have input data that is heavily filtered though. >> Perhaps play around with the scale factor on operations that may blow up >> data? >> >> On Sun, Oct 16, 2016, 10:04 PM 陈竞 <cj.mag...@gmail.com> wrote: >> >>> that's a solution, but, since user may not clearly know whic step will >>> produce tempoary table, i think setting reduce number automatically will >>> improve user experience. I think maybe we can set reduce number as 1/3 >>> mapper number before submitting jobs if one of the job inputs is temporary >>> table. >>> >>> 2016-10-14 18:59 GMT+08:00 David Ortiz <dpo5...@gmail.com>: >>> >>> You can manually set the reducer number using the conf object among >>> other things. >>> >>> On Fri, Oct 14, 2016, 5:43 AM 陈竞 <cj.mag...@gmail.com> wrote: >>> >>> hi, i found that if the pipeline produce temporary table , the reduce >>> number of the temporary table whose input table is temporary table become >>> to small, since temporary table has no content . >>> >>> >>> >>> >>> -- >>> 陈竞,中科院计算技术研究所,高性能计算机中心 >>> Jing Chen HPCC.ICT.AC China >>> >> > > > -- > 陈竞,中科院计算技术研究所,高性能计算机中心 > Jing Chen HPCC.ICT.AC China >