How to control shufffle map output to write to disk or stay in memory?

Maria Thu, 14 Apr 2016 09:29:05 -0700

Hi, all:
   I have several questions about tez shuffle stage:
1) how to understand "pipelined shuffle"? Does it is becase the pipeline sort? 
I find some comments about pipelined shuffle in 
ShuffleSchaduler.copySucceeded(),but still cannot fully understand:
      * In case of pipelined shuffle, it is quite possible that fetchers pulled 
the FINAL_UPDATE spill in advance due to smaller output size.  In such 
scenarios, we need to wait until we retrieve all spill 
      * details to claim success.
Can you please explain the meaning more?


2) Are there any other shuffle mode besides pipelined shuffle?  the legacy 
mapreduce shuffle? (I know that tez borrows much of the MR shuffle.)
3) Where is the map output data stored? how to control its storage，Is there any 
parameters for that？
4) If the map output stored in memory, how does custom vertex and tasks to 
fetch them from memory? And if we do not re-use container,who manage map 
outputs?
5) Does one fetcher  corresponds with one mapoutput? And a  fetcher just  pull 
one-time of all the data produced by one map output?

Any reply will be much appreciated.

Maria~.

How to control shufffle map output to write to disk or stay in memory?

Reply via email to