Thanks~
2013/5/9 Robert Evans <[email protected]> > The I am really not sure what is happening. Try profiling your task. > > --Bobby > > On 5/8/13 11:48 AM, "牛兆捷" <[email protected]> wrote: > > >Just for simplicity, I run only one map task for such as 256mb, then I set > >my io.sort.memory to more than 512mb to make sure all input can stay in > >memory, I also check the log to make sure there is just on spill happen > >for > >flushing. > > > >So I think the different part run one by one, but the cpu utilization is > >out of my expect. > > > > > >2013/5/9 牛兆捷 <[email protected]> > > > >> I have enough memory, so there will be only one sort and spill. Why do > >> they will happen parallel? > >> > >> > >> 2013/5/9 Robert Evans <[email protected]> > >> > >>> Yes it all happens in parallel even on a single task > >>> > >>> On 5/8/13 11:17 AM, "牛兆捷" <[email protected]> wrote: > >>> > >>> >I forget to say, for see the behavior of single task, I just run one > >>>map > >>> >task for 1G input-split(I set block size to 1GB) > >>> > > >>> > > >>> >2013/5/9 Robert Evans <[email protected]> > >>> > > >>> >> Deciding on the input split happens in the client. Each map process > >>> >>just > >>> >> opens up the input file and seeks to the appropriate offset in the > >>> file. > >>> >> At that point it reads each entry one at a time and sends it to the > >>>map > >>> >> task. The output of the map task is placed in a buffer. When the > >>> >>buffer > >>> >> gets close to full the data is sorted and spilled out to disk in > >>> >>parallel > >>> >> with the map task still running. It is hard to get CPU time for the > >>> >> different parts because they are all happening in parallel. If you > >>>do > >>> >>have > >>> >> enough ram to store the entire output in memory and you have > >>>configured > >>> >> your sort buffer to be able to hold it all then you will probably > >>>only > >>> >> sort/spill once. > >>> >> > >>> >> --Bobby > >>> >> > >>> >> On 5/8/13 10:25 AM, "牛兆捷" <[email protected]> wrote: > >>> >> > >>> >> >I saw the application container log to trace the map-reduce > >>> >>application. > >>> >> > > >>> >> >For map task, I find there are mainly 3 phase: spilit input, sort > >>>and > >>> >> >spill > >>> >> >out. > >>> >> >I set the enough memory to make sure the input can stay in memory. > >>> >> > > >>> >> >Initially, I thought the highest cpu utilization will appear in > >>>sort > >>> >>phase > >>> >> >because the other two phase focus on IO,however, it doesn't behave > >>>as > >>> >>what > >>> >> >I thought. On the contrary, the cpu utilization during the other > >>> phase > >>> >> >are > >>> >> >higher. > >>> >> > > >>> >> >Anyone know the reason? > >>> >> > > >>> >> >-- > >>> >> >*Sincerely,* > >>> >> >*Zhaojie* > >>> >> >* > >>> >> >* > >>> >> > >>> >> > >>> > > >>> > > >>> >-- > >>> >*Sincerely,* > >>> >*Zhaojie* > >>> >* > >>> >* > >>> > >>> > >> > >> > >> -- > >> *Sincerely,* > >> *Zhaojie* > >> * > >> * > >> > > > > > > > >-- > >*Sincerely,* > >*Zhaojie* > >* > >* > > -- *Sincerely,* *Zhaojie* * *
