The I am really not sure what is happening. Try profiling your task. --Bobby
On 5/8/13 11:48 AM, "牛兆捷" <[email protected]> wrote: >Just for simplicity, I run only one map task for such as 256mb, then I set >my io.sort.memory to more than 512mb to make sure all input can stay in >memory, I also check the log to make sure there is just on spill happen >for >flushing. > >So I think the different part run one by one, but the cpu utilization is >out of my expect. > > >2013/5/9 牛兆捷 <[email protected]> > >> I have enough memory, so there will be only one sort and spill. Why do >> they will happen parallel? >> >> >> 2013/5/9 Robert Evans <[email protected]> >> >>> Yes it all happens in parallel even on a single task >>> >>> On 5/8/13 11:17 AM, "牛兆捷" <[email protected]> wrote: >>> >>> >I forget to say, for see the behavior of single task, I just run one >>>map >>> >task for 1G input-split(I set block size to 1GB) >>> > >>> > >>> >2013/5/9 Robert Evans <[email protected]> >>> > >>> >> Deciding on the input split happens in the client. Each map process >>> >>just >>> >> opens up the input file and seeks to the appropriate offset in the >>> file. >>> >> At that point it reads each entry one at a time and sends it to the >>>map >>> >> task. The output of the map task is placed in a buffer. When the >>> >>buffer >>> >> gets close to full the data is sorted and spilled out to disk in >>> >>parallel >>> >> with the map task still running. It is hard to get CPU time for the >>> >> different parts because they are all happening in parallel. If you >>>do >>> >>have >>> >> enough ram to store the entire output in memory and you have >>>configured >>> >> your sort buffer to be able to hold it all then you will probably >>>only >>> >> sort/spill once. >>> >> >>> >> --Bobby >>> >> >>> >> On 5/8/13 10:25 AM, "牛兆捷" <[email protected]> wrote: >>> >> >>> >> >I saw the application container log to trace the map-reduce >>> >>application. >>> >> > >>> >> >For map task, I find there are mainly 3 phase: spilit input, sort >>>and >>> >> >spill >>> >> >out. >>> >> >I set the enough memory to make sure the input can stay in memory. >>> >> > >>> >> >Initially, I thought the highest cpu utilization will appear in >>>sort >>> >>phase >>> >> >because the other two phase focus on IO,however, it doesn't behave >>>as >>> >>what >>> >> >I thought. On the contrary, the cpu utilization during the other >>> phase >>> >> >are >>> >> >higher. >>> >> > >>> >> >Anyone know the reason? >>> >> > >>> >> >-- >>> >> >*Sincerely,* >>> >> >*Zhaojie* >>> >> >* >>> >> >* >>> >> >>> >> >>> > >>> > >>> >-- >>> >*Sincerely,* >>> >*Zhaojie* >>> >* >>> >* >>> >>> >> >> >> -- >> *Sincerely,* >> *Zhaojie* >> * >> * >> > > > >-- >*Sincerely,* >*Zhaojie* >* >*
