Hi,
During my development of ETLs on hadoop platform, there is one question I want 
to ask, why hadoop didn't provide a round robin partitioner?
>From my experience, it is very powerful option for small limited distinct 
>value keys case, and balance the ETL resource. Here is what I want to say:
1) Sometimes, you will have an ETL with small number of Keys, for example, 
partitioned the data by Dates, or by Hours etc. So in every ETL load, I will 
have very limited count of unique key values (Maybe 10, if I load 10 days data, 
or 24 if I load one days data and use the hour as the key).2) The 
HashPartitioner is good, given it will randomly generate the partitioner 
number, if you have a large number of distinct keys.3) A lot of times, I have 
enough spare reducers, but because the hashCode() method happens to return 
several keys into one partitioner, all the data of those keys will go to the 
same reducer process, which is not very efficiently as there are some spare 
reducers just happen to get nothing to do.4) Of course I can implement my own 
partitioner to control this, but I wonder it should not to be too harder to 
implements a round robin partitioner as in general case, which will equally 
distribute the different keys into the available reducers. Of course, with the 
distinct count of keys grows, the performance of this partitioner decrease 
badly. But if we know the count of distinct keys is small enough, use this kind 
of parittioner will be a good option, right?
Thanks
Yong                                      

Reply via email to