Deciding on the input split happens in the client. Each map process just opens up the input file and seeks to the appropriate offset in the file. At that point it reads each entry one at a time and sends it to the map task. The output of the map task is placed in a buffer. When the buffer gets close to full the data is sorted and spilled out to disk in parallel with the map task still running. It is hard to get CPU time for the different parts because they are all happening in parallel. If you do have enough ram to store the entire output in memory and you have configured your sort buffer to be able to hold it all then you will probably only sort/spill once.
--Bobby On 5/8/13 10:25 AM, "牛兆捷" <[email protected]> wrote: >I saw the application container log to trace the map-reduce application. > >For map task, I find there are mainly 3 phase: spilit input, sort and >spill >out. >I set the enough memory to make sure the input can stay in memory. > >Initially, I thought the highest cpu utilization will appear in sort phase >because the other two phase focus on IO,however, it doesn't behave as what >I thought. On the contrary, the cpu utilization during the other phase >are >higher. > >Anyone know the reason? > >-- >*Sincerely,* >*Zhaojie* >* >*
