Hi,
During my development of ETLs on hadoop platform, there is one question I want
to ask, why hadoop didn't provide a round robin partitioner?
>From my experience, it is very powerful option for small limited distinct
>value keys case, and balance the ETL resource. Here is what I want to say:
1) Sometimes, you will have an ETL with small number of Keys, for example,
partitioned the data by Dates, or by Hours etc. So in every ETL load, I will
have very limited count of unique key values (Maybe 10, if I load 10 days data,
or 24 if I load one days data and use the hour as the key).2) The
HashPartitioner is good, given it will randomly generate the partitioner
number, if you have a large number of distinct keys.3) A lot of times, I have
enough spare reducers, but because the hashCode() method happens to return
several keys into one partitioner, all the data of those keys will go to the
same reducer process, which is not very efficiently as there are some spare
reducers just happen to get nothing to do.4) Of course I can implement my own
partitioner to control this, but I wonder it should not to be too harder to
implements a round robin partitioner as in general case, which will equally
distribute the different keys into the available reducers. Of course, with the
distinct count of keys grows, the performance of this partitioner decrease
badly. But if we know the count of distinct keys is small enough, use this kind
of parittioner will be a good option, right?
Thanks
Yong