Yes it all happens in parallel even on a single task On 5/8/13 11:17 AM, "牛兆捷" <[email protected]> wrote:
>I forget to say, for see the behavior of single task, I just run one map >task for 1G input-split(I set block size to 1GB) > > >2013/5/9 Robert Evans <[email protected]> > >> Deciding on the input split happens in the client. Each map process >>just >> opens up the input file and seeks to the appropriate offset in the file. >> At that point it reads each entry one at a time and sends it to the map >> task. The output of the map task is placed in a buffer. When the >>buffer >> gets close to full the data is sorted and spilled out to disk in >>parallel >> with the map task still running. It is hard to get CPU time for the >> different parts because they are all happening in parallel. If you do >>have >> enough ram to store the entire output in memory and you have configured >> your sort buffer to be able to hold it all then you will probably only >> sort/spill once. >> >> --Bobby >> >> On 5/8/13 10:25 AM, "牛兆捷" <[email protected]> wrote: >> >> >I saw the application container log to trace the map-reduce >>application. >> > >> >For map task, I find there are mainly 3 phase: spilit input, sort and >> >spill >> >out. >> >I set the enough memory to make sure the input can stay in memory. >> > >> >Initially, I thought the highest cpu utilization will appear in sort >>phase >> >because the other two phase focus on IO,however, it doesn't behave as >>what >> >I thought. On the contrary, the cpu utilization during the other phase >> >are >> >higher. >> > >> >Anyone know the reason? >> > >> >-- >> >*Sincerely,* >> >*Zhaojie* >> >* >> >* >> >> > > >-- >*Sincerely,* >*Zhaojie* >* >*
