Hi all, I'm greatly confused about the spill/sort/merge thing going on during the Map phase.
Here are some stats: - io.sort.mb = 256 MB (80% spill threshold) - io.sort.factor = 64 - spills performed during Map: 117 - number of reducers: 96 Now I'm having real trouble understanding the following log output. ... mapred.Merger: Merging 117 sorted segments mapred.Merger: Down to the last merge-pass, with 0 segments left of total size: 0 bytes ... mapred.Merger: Merging 117 sorted segments mapred.Merger: Merging 54 intermediate segments out of a total of 56 mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 67119046 bytes ... mapred.Merger: Merging 117 sorted segments mapred.Merger: Merging 54 intermediate segments out of a total of 117 mapred.Merger: Down to the last merge-pass, with 64 segments left of total size: 1609011189 bytes ... What exactly is a segment? Is it the number of spills? What does "0 segments left" mean? Does it mean that the merge could be performed on the first pass? Why are only 54 segments merged instead of "io.sort.factor" segments? (io.sort.factor determines the number of files to merge during a pass, right?) Why is the merge performed "number of reducers" times? (I'm counting the phrase "Merging 117 segments" exactly 96 times) Thanks a lot! Martin
