great questions, i am also looking forward to answers from expert(s) here.
2013/7/16 Felix.徐 <[email protected]> > Hi all, > > I am trying to understand the process of Collect, Spill and Merge in Map, > I've referred to a few documentations but still have a few questions. > > Here is my understanding about the spill phase in map: > > 1.Collect function add a record into the buffer. > 2.If the buffer exceeds a threshold (determined by parameters like > io.sort.mb), spill phase begins. > 3.Spill phase includes 3 actions : sort , combine and compression. > 4.Spill may be performed multiple times thus a few spilled files will be > generated. > 5.If there are more than 1 spilled files, Merge phase begins and merge > these files into a big one. > > If there is any miss understanding about these phases, please correct me > ,thanks! > And my questions are: > > 1.Where is the partition being calculated (in Collect or Spill) ? Does > Collect simply append a record into the buffer and check whether we should > spill the buffer? > > 2.At Merge phase, since the spilled files are compressed, does it need to > uncompressed these files and compress them again? Since Merge may be > performed more than 1 round, does it compress intermediate files? > > 3.Does the Merge phase at Map and Reduce side almost the same (External > merge-sort combined with Min-Heap) ? > >
