>> What exactly is a segment? Is it the number of spills? >A segment in this context is a fraction of spill output for a particular reduce. Each spill contains a segment for every reduce.
Ah, alright. But why is Hadoop telling me that there are 117 segments given that only 96 reducers have been configured? (btw, I'm using Hadoop 1.0.0) >> Why are only 54 segments merged instead of "io.sort.factor" segments? (io.sort.factor determines the number of files to merge during a pass, right?) > The intermediate merge of 54 files to 1 reduces the number of files to 117 - 53 = 64 segments. The final merge is over 64 segments. Ok, that makes sense. >> Why is the merge performed "number of reducers" times? (I'm counting the > phrase "Merging 117 segments" exactly 96 times) > Each invocation of the merger is combining all the output assigned to a reduce by the partitioner. So the merger is called "number of reducers" times because it combines the data for a particular reducer which is spread over all spill files, right? Martin On Mon, Sep 17, 2012 at 10:21 AM, Chris Douglas <[email protected]> wrote: > On Thu, Sep 13, 2012 at 7:04 AM, Martin Dobmeier > <[email protected]> wrote: > > What exactly is a segment? Is it the number of spills? > > A segment in this context is a fraction of spill output for a > particular reduce. Each spill contains a segment for every reduce. > > > What does "0 segments left" mean? Does it mean that the merge could be > > performed on the first pass? > > Why are only 54 segments merged instead of "io.sort.factor" segments? > > The intermediate merge of 54 files to 1 reduces the number of files to > 117 - 53 = 64 segments. The final merge is over 64 segments. > > > (io.sort.factor determines the number of files to merge during a pass, > > right?) > > Why is the merge performed "number of reducers" times? (I'm counting the > > phrase "Merging 117 segments" exactly 96 times) > > Each invocation of the merger is combining all the output assigned to > a reduce by the partitioner. -C >
