Just for simplicity, I run only one map task for such as 256mb, then I set my io.sort.memory to more than 512mb to make sure all input can stay in memory, I also check the log to make sure there is just on spill happen for flushing.
So I think the different part run one by one, but the cpu utilization is out of my expect. 2013/5/9 牛兆捷 <[email protected]> > I have enough memory, so there will be only one sort and spill. Why do > they will happen parallel? > > > 2013/5/9 Robert Evans <[email protected]> > >> Yes it all happens in parallel even on a single task >> >> On 5/8/13 11:17 AM, "牛兆捷" <[email protected]> wrote: >> >> >I forget to say, for see the behavior of single task, I just run one map >> >task for 1G input-split(I set block size to 1GB) >> > >> > >> >2013/5/9 Robert Evans <[email protected]> >> > >> >> Deciding on the input split happens in the client. Each map process >> >>just >> >> opens up the input file and seeks to the appropriate offset in the >> file. >> >> At that point it reads each entry one at a time and sends it to the map >> >> task. The output of the map task is placed in a buffer. When the >> >>buffer >> >> gets close to full the data is sorted and spilled out to disk in >> >>parallel >> >> with the map task still running. It is hard to get CPU time for the >> >> different parts because they are all happening in parallel. If you do >> >>have >> >> enough ram to store the entire output in memory and you have configured >> >> your sort buffer to be able to hold it all then you will probably only >> >> sort/spill once. >> >> >> >> --Bobby >> >> >> >> On 5/8/13 10:25 AM, "牛兆捷" <[email protected]> wrote: >> >> >> >> >I saw the application container log to trace the map-reduce >> >>application. >> >> > >> >> >For map task, I find there are mainly 3 phase: spilit input, sort and >> >> >spill >> >> >out. >> >> >I set the enough memory to make sure the input can stay in memory. >> >> > >> >> >Initially, I thought the highest cpu utilization will appear in sort >> >>phase >> >> >because the other two phase focus on IO,however, it doesn't behave as >> >>what >> >> >I thought. On the contrary, the cpu utilization during the other >> phase >> >> >are >> >> >higher. >> >> > >> >> >Anyone know the reason? >> >> > >> >> >-- >> >> >*Sincerely,* >> >> >*Zhaojie* >> >> >* >> >> >* >> >> >> >> >> > >> > >> >-- >> >*Sincerely,* >> >*Zhaojie* >> >* >> >* >> >> > > > -- > *Sincerely,* > *Zhaojie* > * > * > -- *Sincerely,* *Zhaojie* * *
